EDUCATIONAL 
MEASUREMENT 
AND EVALUATION | 


Educational 
Measurement 


and Evaluation 


„ zZ 
Educational 
Measurement 


and Evaluation 


Jum C. Nunnally 


Professor of Psychology 
Vanderbilt University 


2 Werd 
* Library . N 


0 
„ §> al 
x X 3 


= 8 


McGraw-Hill Book Company 


New York 
San Francisco 
Toronto 
London 


S.C.E R.T., West Benga, 
Date IS. . J. G7. =, 
Acc, No. 49.8 


Educational Measurement and Evaluation 


Copyright © 1964 by McGraw-Hill, Inc. 

All Rights Reserved. Printed in the United States of America. 
This book, or parts thereof, may not be reproduced 

in any form without permission of the publishers. 


Library of Congress Catalog Card Number 63-19767 
III 


47557 


To my children: Jeffry, Russell, Scott, and Kimberly. 
They were kind enough to understand that in order for 
me to write this book I would not be very available as 
a father and pal during a period of one and a half years. 


Preface 


The book is intended as a comprehensive text for 
undergraduate courses in measurement and evalua- 
tion in education curricula. The content of the book 
is specifically chosen to meet the interests and needs 
of future teachers in elementary and secondary schools. 
In order of emphasis, the book concerns (1) general 
principles of measurement and evaluation, (2) the 
construction and use of teacher-made tests, (3) the 
use and interpretation of commercially distributed 
achievement tests, (4) the measurement of intelli- 
gence and special aptitudes, and (5) the measure- 
ment of attitudes, interests, and personality. That 
order of emphasis was chosen after a careful study of 
existing texts and after discussions with subject- 
matter experts. 

A central theme of the book is that tests are help- 
ful only to the extent that they help in making educa- 
tional decisions. Each major type of test is discussed 
with respect to its contribution to particular kinds of 
educational decisions, and the book is filled with many 
concrete examples of how tests can be effectively used 
for that purpose. 

A potential stumbling block in writing a book of this 
kind is the treatment of statistics. No prior knowledge 
of statistics is presumed. In order to understand how 
tests are constructed and effectively used, it is ab- 
solutely essential to understand some statistical con- 
cepts. The author firmly holds the point of view, 
however, that even if the understanding of statistical 
concepts is vital, an intimate knowledge of statistical 


VII 


viii 
Preface 


techniques is not highly essential. Consequently, the 
book strives to create an understanding of those 
statistical concepts that are essential for effective 
measurement and evaluation. Almost no statistical 
formulas are presented in the body of the text. 
Numerous statistical arguments and formulas are 
presented in a generous appendix. Some students may 
want to study these; others will have them available 
when technical problems arise in the application of 
measurement methods to daily problems in the 
classroom. 

Throughout the book an effort is made to spell 
out principles rather than just state them. Thus, 
rather than say only that “test items should be clearly 
worded,” many specific rules are given for clearly 
wording test items, and many e 
good and poor test items. 

In the book no effort was made to 
Over controversial issues. Exce 
the author could not make up 
versial issues are met head on 
mendations are give: 
future rese 


xamples are given of 


avoid or gloss 

pt for some cases where 
his own mind, contro- 
and specific recom- 

n. Of course this means that 
arch will prove some (hopefully, not too 
many) of the opinions presented here to be incorrect, 
It would be a rare instructor who did not disagree 
with some of the author’s interpretations of existing 
data or who did not have different attitudes from 
some of those expressed in this book. It is hoped, 
however, that by clearly indicating areas of contro- 
versy and by taking definite stands, the author will 
have demonstrated to students how critical arguments 
are made in this area of knowledge and how to 
sharpen their facility for ev. 
and ideas concerning educational measur 


Although throughout the book m 
Siven of particular tests 
the book is no 


all the g 


aluating new developments 
ement. 

any examples are 
and measurement methods, 
t intended to be a dry compendium of 
de good instruments that are available. The au- 
thor’s experience has been that nothing bores stu- 
dents more than to have to wade through page after 
page of illustrative test materials and technical de- 
scriptions of particular tests, In Chapter 17 sources 
are given for learning about available tests, In Ap- 


pendix D short descriptions and evaluations are given 


ix 
Preface 


of numerous commercially distributed tests of achieve- 
ment, aptitude, and others. 

As is customary to say at about this point in the 
preface, in writing this book the author owes a great 
deal to a number of people. Thanks are due to Dr. 
Lawrence Wrightsman for the very helpful advice 
he gave on the selection of content for the book. The 
book owes a great deal to the patience and fine 
secretarial skill of Miss Dorothy Timberlake. Above 
all I owe a debt of gratitude to my wife and children. 
In order to write the book it was necessary to work 
many hours that ordinarily would have been spent 
with my family. We shall consider the time well 
spent if in some small way this book contributes 
to the effectiveness of classroom teachers. 


Jum C. Nunnally 


Contents 


Preface vii 


Part I Basic Principles of Measurement and Evaluation 


Chapter 1 Why Use Tests? 3 
What Is a Test? 6 


Physical and Mental Measurement 8 
Advantages of Measurement 9 
Measurement in Education 10 


Chapter 2 What Isa Good Test? 14 
The Prediction Function 15 


Assessment Functions 19 

Measurement of Psychological Traits 22 

Relations among Prediction, Assessment, and Trait 
Measurement 24 


Chapter 3 Scores, Norms, and Statistics 29 
Measures of Average Performance 30 
Measures of Dispersion 35 
Transformations of the Mean and Standard 

Deviation 37 

Score Distributions 38 
Scores Based on Ranks 46 
Norms 49 


xii 


Contents 


Cautions in the Use of Norms 54 
The Teacher's Use of Norms 57 


Chapter 4 Correlation and Reliability 59 
Correlational Analysis 60 
Reliability of Measurements 68 
Sources of Unreliability 77 
Estimating the Reliability 82 
Increasing the Reliability 87 


Part II Construction and Use of Teacher-made Tests 


Chapter 5 Planning the Test 91 
Information Supplied by Classroom Tests 92 
Outline of Objectives 96 
Outline for Eighth-grade Geography 99 
Outline of Objectives for Physics 101 
Special Problems 103 
A Realistic Outlook on Test Construction 106 


Chapter 6 Test Items 108 


Essay and Objective Examinations 109 
Types of Objective Items 114 

Rules for Writing Multiple-choice Items 121 
Construction of Essay Items 128 

Item Analysis 132 


Chapter 7 Scoring, Grading, and Reporting 140 


Scoring Objective Tests 140 
Scoring Essay Items 143 


Evaluation by Grading 148 
How to Evaluate 152 
Reporting Grades 155 


Combining Scores 158 


A Realistic Outlook on Evaluation 164 


xiii 


Contents 


Part III Standardized Achievement Tests 


Chapter 8 Construction and Use of Standardized 
Achievement Tests 169 


Construction of Achievement Tests 173 
Major Uses of Achievement Tests 177 
Chapter 9 Comprehensive Achievement Tests 186 


Content of Comprehensive Tests 186 
Principles for Using Comprehensive Achievement 
Tests 195 


Chapter 10 Achievement Tests for Special Topics 199 
Survey Tests for Special Topics 199 
Diagnostic Achievement Tests 203 
Directions for Diagnosis 207 


Part IV Prediction and Trait Measurement: Human Abilities 


Chapter 11 Factors of Intellect 215 
Factor Analysis 217 
Factors of Ability 219 
Verbal Factors 221 
Number Skills 
Reasoning Factors 223 
Memory Factors 224 
Spatial Factors 226 
Perceptual Factors 227 
Speed Factors 228 
Factors and the Child 229 


Multifactor Batteries 233 


Chapter 12 Tests of General Ability 241 
The Content of “Intelligence Tests” 241 


Achievement and Aptitude 245 


xiv 


Contents 


Individual versus Group Tests 246 


Verbal versus Performance Tests 248 
The Binet Test and Its Followers 250 
The Wechsler Scales 257 

Group Tests of General Ability 262 


General Intelligence Tests for Infants and Preschool 
Children 264 

The Nature of “General Intelligence” 268 

Using Tests of General Ability in Schools 272 


Chapter 13 Special Abilities 278 
Vision 279 
Audition 280 
Practical Uses for Sensory Tests 282 
Mechanical Aptitude 283 
Clerical and Stenographic Aptitudes 293 
Artistic Aptitudes 294 
Musical Aptitude 295 
Graphic Art 298 


Chapter 14 Creativity 304 
Traits Relating to Creativity 305 
Measuring Creativitv 309 
Recognizing and Promoting Creativity in the 
Classroom 317 


Part V Prediction and Trait Measurement: 
Interests, Attitudes, and Personality 


Chapter 15 Attitudes and Interests 325 
Interest Inventories 325 
Interests in Daily Activities 331 
Measurement of Attitudes 334 


Chapter 16 Measurement of Personality 344 
Self-report Methods 345 
Projective Techniques 353 


XV 


Contents 


Observation of Behavior 364 
Promising New Approaches 370 


Part VI Development of Testing Programs 


Chapter 17 Development of Testing Programs 379 


School-wide Programs 379 
Who Does What 384 
Sources of Information 387 


Some of the Most Important Uses of Tests 390 


A Philosophy of Measurement 392 


Appendixes 


Appendix A Major Publishers of Psychological and 


Appendix B Proportions © 


Appendi. 


Appendi: 


Educational Tests 395 


f the Area in Various Sections of the 


Normal Distribution 396 


x C Statistical Appendix 399 


xD Commercially Distributed Tests 416 


Section 1 Comprehensive Achievement Test 


Batteries 416 
Reading Achievement Tests 419 


Section 2 
Group Tests of General Intelligence 


Section 3 
Section 4 Interest Inventories 424 


Section 5 Personality and Adjustment Inventories 


References 429 
Index 433 


421 


426 


part 1 


Basic Principles 
of Measurement 


and Evaluation 


chapter 


Why Use Tests? 


Mrs. Throckmorton feels both proud and relieved. Her third-grade 
angels have shouted their way out of the building and are meandering 
to their night-time homes. “What’s on the docket for tomorrow? We 
have to finish up the spelling bee which was interrupted by lunch, It’s 
too muddy to play out, so we will have to think of something for the 
children to do inside at recess. If Weekly Readers come in tomorrow 
we could start on that, and we still haven't finished the mimeographed 
exercises. Yipes! Tomorrow is the 14th—two hours of achievement 
testing. I wonder if my children will do better this year in number 
skills.” 

Mr. Blair gave his eleventh-grade students back their midsemester 
examinations in American history. “I don’t think they liked it very 
much, but they got what they deserved. Twice during the month I told 
them that they needed to study more.” 

Pete Bronson stayed after class to express great concern about his 
C+ grade in biology. Pete reminded Mr. Murdock, as he frequently 
does all his teachers, that he intends to go to medical school in six 
more years, and he needs good grades. Pete tries to find fault with the 
coverage and grading of the examination, but Mr. Murdock assures 
him that the examination was carefully constructed and graded accord- 
ing to the most exacting of standards. 

Miss Brown is having a rough time of it. It is her first year out of 
teachers college, and she has thirty-two of the toughest customers that 
ever hit second grade. “I get along with most of them most of the time, 
but I don’t understand Jeffrey. He never sits still. Whatever we do, he 
does something else. Even sitting in a corner by himself, he can keep 
the whole class in an uproar. I would rather have a full grown hippo- 


dering freely around the room than have Jeffrey in the 


potamus wan 
ve him some tests and talked 


class. Maybe if the school psychologist ga 
with the parents, we might find some ways to help Jeffrey.” 


4 


Basic Principles of Measurement and Evaluation 


Teachers are by far the greatest users and producers of 8 aor 
swim in a sea of grades, norms, achievement Jeti 1Qs, ie S 
wide testing programs. Consequently, the teacher who is 8 f m 
with modern theory and methods of testing is lacking a spm ae Dar 
of his calling. Our purposes here in this book are ee Wea 5 
Prospective teachers up to date on the most ene 1 5 
construction, standardization, administration, scoring, and in ep an 
tion. Before we go into details about how tests are pie 
used, we should first consider what, if anything, tests can do ae : 
teacher. This leads to our first principle: Tests are useful only if they 

in making decisions. 

99 oe are test can be traced to a need to make a decision sien 
a pupil, about a method of instruction, or about a earn ThE ia i 
“decision” must be construed broadly to mean all the courses of ac 705 
that might follow from test scores. If Johnny makes a low h on 10 6 
achievement test, the decisions might be to place Johnny in a remedia 
section, to talk with Johnny’s parents about his low standing, and 5 
encourage Johnny to work harder. If the best that Pete Pronson vo 
do is to make C+ grades in all his courses, he might well make th 

decision to choose a vocation after finishing high school rather tian 
make an uphill fight in college. If on the achievement tests the W iley 
School students come out relatively much lower in arithmetic than they 


7 p y i onsider new 
do in other subject-matter areas, the decision might be to consider ne 


2 . ab, ery W 
methods of instruction in that topic. If a student makes very lov 


inati i ‘ era 80181 ight 
grades on all the examinations in American history, the decision mig 
be to give a failing grade 


If it is found that Jeffrey has a very 
to provide Jeffrey with extra materi 
When students make poor scores 
sions might be to talk with par 
counselor, or consult w 

Sometimes tests are 


is the case with intelli 
about cl 


high IQ, the decision might be 
al to occupy his overactive mind. 
on adjustment inventories, the deci- 
ents, have students see the school 
ith a clinical psychologist. ; 
used to help in making numerous decisions. This 
gence tests, which are useful in making decisions 


assroom sectioning, handling of adjustment problems, voca- 
tional guidance, and many others, 


If a test is any good, it should hel 
How well a particul 


indication of how 
about the kinds of 


p in making some kind of decision. 
ar test helps in making particular decisions is an 
valid it is. In this book we will talk a great deal 


decisions which are aided by the use of tests, the 


ways in which decisions are made, and the means which are used to 
test the validity of decisions. 


Two of the m 
are quite wron 
necess 


any schools of thought about the 
g. The first holds th 
‘ary intrusions into the sacred 


effectiveness of tests 
at tests are both useless and un- 
Privacy of human lives. The second 


5 
Why Use Tests? 


holds that psychological tests are infallible gauges of present abilities 
and personality characteristics and infallible predictors of future 
success and adjustment. The truth lies between these two extremes. 
Although existing tests have their weaknesses, it is safe to say that tests 
usually work better than subjective judgments. 

What would the teacher use if tests were not available? On what 
would he base grades? He probably would do what was done fifty 
years ago and base his grades on classroom recitations. The difficulty 
with these is that they favor the outgoing, glib, self-confident student 
to the detriment of the less confident, shy, but often more knowledgea- 
ble, student. If tests were not available, how would sectioning be 
accomplished? It would probably be done on the basis of only several 
days of experience with new students. Even though the teacher often 
is a talented observer of students, like any other observer he falls prey 
to typical errors. He has his own idiosyncratic definitions of what 
constitutes superior abilities—ones which would not be completely 
shared by other teachers in the same situation. Being human, it is easy 
for him to confuse likeability with ability and to advance those students 
who fall into favor. 

If the teacher has had only a relatively short period of acquaintance 
with students, much of his judgment is based on the accidental occur- 
rences in that short time. After three or four months of experience with 
students, he can make much better estimates of their abilities, but 
then it is too late. Looking back on his earlier estimates of ability, he 
can see that he made many mistakes. It takes time to judge people, 
and if there is not considerable time available, judgments often are 
quite inaccurate. 

In making judgments about students, either about their progress in 
nits of instruction or about their over-all abilities, it is dif- 
ficult to become familiar with all of the class. Out of thirty students, 
six or eight will stand out, either because they are so superior or 
because they cause so much trouble of one kind or another. The in- 
between students are like the background of a painting, never attract- 
ing much attention, never doing anything either so good or so bad as 
erefore, it is very difficult for the teacher to make 


particular u 


to receive notice; th 
accurate judgments of them. 

Without the use of tests, how would teachers judge the effectiveness 
of new methods of instruction? In this case, say that a new method of 
reading instruction is being employed. After students have undergone 
the new type of instruction, the only resource available would be to 
reach subjective conclusions about how well the new method works. 
Such decisions would be influenced by many things which are un- 
related to the issue. Some people are skeptical of anything new and 
would tend to downgrade new procedures purely because of this. 


6 


Basic Principles of Measurement and Evaluation 


Other people are joyously optimistic about anything new, jo 
newness means improvement or not. If the new method of agin 5 
is carried out by an unpopular teacher, other teachers are mote a 
regard the results as dubious, while if it is carried out by a ie 
respected teacher, others are likely to approve of the results regardles 
of their intrinsic merits. ; A 

The purpose of tests is to help take the personal element a n 
guesswork out of decisions such as those described above. In suc, 
cases, tests have their weaknesses, but they usually are far more effec- 
tive than the alternative means in meeting most objectives of evalua- 
tion. Because there is nothing better to employ, it is not sensible to 
discuss whether or not tests should be used. Rather, the inportant 
points to discuss are how tests should be used and what procedures 
can be employed to make tests more effective. 


What Is a Test? 


A test is a standardized situation that provides an individual with a 
score. Let us look at the two important words in the definition: “stand- 
ardized” and “score,” By standardized is meant that all the procedures 


of testing a way that all students are 
ions or problems and in the same way. A 
if two test administrators working inde- 
pendently could obtain the same results with the same group of 
students. Some of the most essential ingredie 
(1) all students should answer the 8 
should be clear, 
students; (3) 


nts of standardization are 
ame questions; (2) instructions 
and the same instructions should be given to all 
no student should be given any advantage not given to 
all students; and (4) a predetermined system of scoring should uni- 
formly be applied to the answers of all students. 

Several examples will suffice to show poor standardization. In ad- 
ministering his own arithmetic test, a teacher fails to tell students that 
only thirty minutes’ time will be allowed, With only ten minutes left, 
the teacher announces the time limit, Assuming that the test would 
last an hour, some students have been working slowly, and they do 
not have enough time left to show what they actually know, 

In grading an essay examination, the teacher grades all the questions 
for one student, then all the questions for the next student, and so on 
raded. At first the teacher has very high standards 
after seeing that none of the Papers measure up to his 
he becomes more lenient, Obviously, the grades of 
nd to a considerable extent on the order in which tests 


for grading, but 
expectations, 
students depe 
are graded. 


An intelligence test is used to help make a decision about the grade 


7 
Why Use Tests? 


placement of a transfer student. The test indicates that the student has 
an IQ well below average. Unfortunately, the norms for the test were 
not carefully constructed. A better standardized test would show the 
student to be of average ability. 

Ideally a test should be standardized to the extent that the testing 
routine can be written down and mailed to Atlanta, Toronto, or 
London; and the testers in those settings would be able to obtain 
results identical to those that would be obtained by the persons who 
originated the test. To the extent to which there are some special 
subjective elements, either in the test administration or in the scoring, 
the test is not completely standardized. Standardization is the essence 
of testing, and without it, it is not proper to use the word “test.” Later 
in the book we will criticize some methods of testing because they 
lack the essential elements of standardization. The need for stand- 
ardization in testing is much the same as the need for standardization 
in the measurement of physical quantities. For measures of length and 
weight, international standards of measurement are kept in Paris, and 
measures the world over are calibrated by these. Although we have 
not yet reached that level of standardization in psychological and 
educational testing, the effort is the same. 

The second important term in the definition of a test is “score.” A 
score means a numerical indication of a student's performance. Why 
do we need a numerical score rather than simply an adjectival descrip- 
tion of how well a student performs, such as “poorly,” “pretty good,” 
“very good,” etc.? Numerical scores are needed because of the preci- 
sion which they provide. Such precision allows us to differentiate be- 
tween the scores of two students, both of whom might be scored as 
“pretty good.” Only when test results are expressed in numerical form 
is it possible to perform mathematical analyses of the results. Without 
such mathematics it would not be possible to establish rules for the use 
of particular tests or to validate the effectiveness of tests in making 
decisions. 

Much of test construction is concerned with our two key words: 
standardization and scoring. To obtain standardization, a carefully 
edited set of problems or questions is used. Instructions for taking the 
test are written out in advance. All necessary materials for taking the 
test are supplied, and account must even be taken of the fact that some 
students are sure to break their pencil points in the middle of the test. 
To maintain standardization, all tests should be given in reasonably 
comparable environments. For example, standardization would be 
broken if it were necessary (for some odd reason) for some students 
to take tests when they were very tired, such as very late in the 
or for some students to take tests in a highly distracting en- 


evening, 
in a room where carpenters are noisily re- 


vironment, for example 


8 


Basic Principles of Measurement and Evaluation 


furbishing the interior. Although common sense would prevent us from 
making such gross breaches of standardization, more subtle failures 
to obtain standardized conditions can adversely influence the results 
of tests. A general rule is that a test is standardized to the extent that 
results are repeatable by other persons in other settings. This is the 
same as saying that a test is standardized if it is reliable. In Chapter 4 
we will discuss at some length the nature of test reliability and the 
factors that promote reliability. 


Physical and Mental Measurement 


Why is there need to have a book of this kind on mental measure- 
ment? In what ways does mental measurement differ from the measure- 
ment of lengths of lumber, temperatures of fluids, velocity of atomic 
particles, heights of blood pressure, and other physical measurement? 
In terms of purposes there are no differences. Both in the measurement 
of mental and physical quantities the intention is to assign numbers to 
objects (with mental measurement the objects are humans) according 
to specified rules. What differences there are between mental and 
physical measurements are matters of degree rather than matters of 
kind. One difference that you often find is that mental measures are 
more changeable than physical measures. For example, if you measure 
the length of a piece of lumber now and two years from now, you will 
find some slight change; however, if you measure the intelligence of a 
child now and two years from now, the relative amount of change (in 
terms of practical implications) is likely to be much greater. Some of 
the personality attributes are even more subject to change over time. 
To show that such changes are not peculiar to mental measurement, 
you also find relatively large changes over time in such physical 
measurement as blood pressure and heart rate. 

One of the major differences between most physical and mental 
measurements is that it is relatively more difficult to interpret some 
mental measurements. Many physical measurements are so easy to 
interpret that the question of interpretation is seldom raised, For ex- 
ample, if you were told that a table is six feet long, the meaning of 
rulers and the measurement of length are readily understood without 
further interpretation. Similarly, with blood pressure: its meaning is 
relatively clear, This is also the case with some types of mental 
measurement. For example, if a student performs poorly on a series of 
arithmetic tests, it points to an obvious d I 
some types of mental measure 
of personality characteristics, 
certain. For example, 


eficiency. However, with 
ments, particularly with measurements 
the meaning of responses is not at all 
if in an ink blot test the student says that a 


9 
Why Use Tests? 


particular portion of a blot looks like a butterfly, it is not at all clear 
what the response indicates about the student. 

Because with some types of mental measures there is a need to 
clarify how results should be interpreted, we speak of “validating” a 
test. By validating a test we determine how useful an instrument is in 
making certain kinds of decisions. Because different kinds of tests are 
intended to help in making different kinds of decisions, a test may be 
useful or valid for one purpose but totally useless or invalid for others. 
Consequently, several different types of validation procedures are 
possible, depending on the nature of the instrument. Chapter 2 will go 


into detail on methods of test validation. 


Advantages of Measurements 


Although the reader probably already appreciates the need for 
psychological and educational measurement, he may not have con- 
sidered the specific advantages which such measurements provide. 
Suppose that a high school teacher is trying to judge the intelligence 
of a student in order to advise him whether or not to go to college. 
Even though the teacher may be an excellent judge of such matters, 
the best that he can do is to describe the student's intelligence in 
rather gross terms such as lligent” or “highly intelligent.” As 
was mentioned earlier, one of the advantages of measurements is that 
they allow a more precise, numerical description of how much the 
student is above or below average. 

Mathematical methods cannot be applied to subjective appraisals 
„ “below average, ete. Mathematics can be applied 

y it is possible to construct norms, make statis- 
5 and establish statistical rules 


“very inte 


such as “very high, 
to test scores. In this wa 
tical predictions of future 
for making decisions with psyc 5 
Another advantage of measurements over subjective judgments is 
that measurements allow us to communicate results to others more 
easily, For example, if a meterologist had no measurement methods, 
all that he could report is that on à particular day the wind was “very 
strong,” The phrase “very strong” might be interpreted very differently 
by different people who hear or read about the results. Because 
an report that the wind velocity 


meterologists have anemometers, they c. 
on a particular day was 46 miles per hour. In the same way, it is 


relatively easy to communicate to others that a student's intelligence 
is above 99 per cent of those in his age group. 

Another advantage. mentioned earlier, is that measurements help 
to take the subjectivity out of evaluating students. A particular teacher 
may become very adept at subjectively estimating the capabilities of 


performance, 
hological tests. 


10 


Basic Principles of Measurement and Evaluation 


students and be able to make wise decisions about them. Unfortu- 
nately, other teachers may not possess the same subjective skills and 
may not be able to evaluate students with high accuracy. But if 
standardized measures were used, all teachers could obtain essentially 
the same valid results. 

A final, and often not fully appreciated, advantage of measurements 
is that they frequently are much more economical than are thorough 
subjective evaluations. It would take at most one hour for a student 
to complete a long and thorough spelling test. Although the results 
would not be a perfect indication of the student's spelling ability, it 
would probably take the teacher several months of informal observa- 
tion to reach a comparable level of evaluation of the students ability 
in spelling. Similarly, a good intelligence test can be given to high 
school students in no more than one hour. To obtain equally valid 
indications of the students’ abilities, if ever, it would take even an 
excellent teacher some months of careful observation. For a relatively 
short amount of time that tests consume, they are far superior to in- 
formal observation and subjective appraisal. Even when teachers have 


months to observe students, some traits are measured more effectively 
by mental tests, This is the case, for example, with intelligence tests, 
which, even though they take relatively short periods of time, are 
usually considerably more valid than any subjective estimates of 
intelligence, Regardless of what faults they may have, mental tests are 
usually cheaper, quicker, and more accurate than subjective appraisals. 


Measurement in Education 


i Most educators agree that developments in psychological and educa- 
tional testing have gone a long way toward taking the guesswork out 
of many types of educational decisions. However, there are still many 
critics of standardized tests, It is easy to find fault with particular tests, 
say a fourth-grade achievement test. Existing tests are far from perfect 
because there is much still that we do not know about methods of 
testing, Tests must be constructed on a limited budget and in a limited 
Anse of ame The people who construct them are fallible humans; 

quently, some test products are not as good as they should be. 

Some say that classroom tests and achievement tests emphasize only 
trivial details, reward the memorizer, and penalize the more creative 
student. There is some truth in these and other criticisms, but they are 


not reasons for doing away with tests but rather for constructing and 
using tests better, 


Regardless of how well or how poorly tests 
would argue that tests are “undemocratic” 
feelings of inferiority, force students to be 


are constructed, some 
that they give students 
competitive with one 


11 
Why Use Tests? 


another, and result in an unhealthy “valuing” of the child. In the same 
vein, some argue that standardized achievement tests place a straight 
jacket on the teacher, causing him to emphasize topics that will appear 
on the test and hindering his pursuit of ideas that he considers more 
important. 

To disavow the 
attitude. Teachers, students 


use of tests would represent a head-in-the-sand 
, and parents need to know the facts. To 
ignore the facts only postpones the day of reckoning and keeps all 
disturbed in the meantime. It is not being democratic, or in the long 
run even kind, to hide the fact that one student has limited abilities or 
that another has special talents. 

We may all be created equal in the sense of human worth, but we 
are not all equal in terms of abilities and personality characteristics. 
It would be a great social waste not to help the talented achieve as 
much as they can and a great social harm to encourage unsuited stu- 
dents to labor in ventures that will bring failure and discouragement. 
Taking a frank look at our abilities and personality characteristics is 
not always pleasant at the moment, but it is necessary for long-range 


achievement and happiness. an nee 10 5 

The real question at issue is “What types of educational decisions 
do we want to make about students, teachers, and methods of instruc- 
tion?” If we decide to issue report cards which show graded perform- 
ance, then tests (usually those composed by the teacher) are the iost 
effective means available for assigning grades. If a school-wide pro- 
gram of counseling is to be instituted, then tests of e and per- 
sonality are essential. If we want to find new and etter methods of 
í it is necessary to use tests to determine the differential 


instruction, ches 
roaches. 


effectivenes he different app i 
sic eer 00 ont try to solve the complex problems of philosophy 


that underlie American education, but one firm stand will be taken: 
5 3 “cions, the more information, the better. For example, if 
In making decisions, ffrey (the child mentioned earlier 
something must be done about Jefrey (the e 
who cor 0 disrupted the class), test results would constitute 
; HA i : aed tee e ‘ 
one useful type of information. Should we not consider his high 1Q in 
YI į we would take into account poor eyesight or a 
a 
hearing defi feisty? As another example, suppose that Mr. Murdock 
ee boys parents why he is being required to repeat the 
must explain to a 000) lock not consult the results of achieve- 


: Mr. Mure 

seventh grade— should } =. 2 n 

aie i : I use these in explaining the problem to the boy’s parents? 
ent tests and us 


4 e canes E onstruct a perfect spelling test, should 
ecause the teac . 1. eabjective judgment of students’ spelling ability? 
he rely e ante a useful source of information to help in 
Smee iis | decisions. The real questions arise when we con- 
hey enea OE will be used, which will be discussed 
sider how test m 2 


the same way th 


12 


Basic Principles of Measurement and Evaluation 


i y riously 

throughout the book. Tests are here to stay, and anyone who z I 
i * 

advocates their disuse is in the same class as tlie person who, in 

exasperation at modern motoring, advocates a return to the horse 


and buggy. 


Summary 


Teachers top the list of professional groups who need and use 
educational and psychological tests. Not onl 
consumers of commercially distributed tests of aptitude, achievement, 
and others; but also, because of the thousands of tests that teachers 
construct for their own students, it is safe to s 
ble for at least 75 per cent of all the tests produced in this country 
each year. For these reasons it is very important for teachers to under- 
stand why various types of tests are used and how they can be used 
to the most advantage, 

It was emphasized that tests are useful only if they help in making 
educational decisions- -about students, about classroom activities, about 
a particular unit of instruction, and about educational practices. To the 
extent to which a test actually helps in making certain types of deci- 
sions, it is said to be valid for those purposes. Because the validity of 


tests must be established rather than taken for granted, various types 
of studies must be undertaken to show that p 
using. 

A test was defined as a st 
student with a numerical lires standardiza- 


tion, and tests are no exception. If tests were not standardized, the 
results would depend haphazardly on the persons who administered 
and scored them. Then it would not be possible to compare scores of 
different students me test, the scores of the 
different tests, or the scores that students make 
is important that results are expresse 
make precise comparisons 
analyses of the results, 


The worth of tests is often challenged on tw 
quently said that tests cause an ur 
according to their abilities 
to educate children by 
other bad side e ; 
necessarily 
use i 


y are they the greatest 


ay that they are responsi- 


articular tests are worth 


andardized situation that provides each 
score. All measurement requ 


same students on 
from year to year. It 
d in numerical form in order to 
among students and to permit statistical 


o counts. First, it is fre- 
nhealthy stratification of children 
and that tests hamper the teacher in trying 
the methods that he thinks best. These and 
fFects of using tests Certainly occur on occasion, not 
because the tests are at fault, but r 

1S sometimes made of them. 


The second indictment of te 
measures of aptitude, 
teristics Which they 


ather because unwise 


sts is that they 
achievement, person 
attempt to measure, 


are not very effective 
ality, and other charac- 


It is easy to find some poor 


13 
Why Use Tests? 


carefully constructed test, and it is easy to 


find whole tests which are no good. Also, there is still much that we 
t, and in some instances 


do not know about educational measuremen 
ant of more valid instru- 


admittedly crude measures must be used for wa 
potential and real faults should not be used as 


arguments to do away with tests, but rather they should act as spurs 
for the construction of better instruments. If tests were not available 
for the measurement of daily progress in school, long-range achieve- 
ment, aptitude, personality, and other characteristics, the only recourse 
would be to the subjective evaluations of teachers and others. W hen- 
ever research comparisons are made between the differential effective- 
ness of tests and subjective evaluations, tests prove to be cheaper, 


quicker, and more valid. 


items on even the most 


ments. However, these 


Suggested Additional Readings 
Cronbach, L. J. Essentials of psychological testing. (2nd ed.) New York: Harper 
& Row, 1960, chaps. 1, 2. A 
Lorge, I. The fundamental nature of measur 
Educational measurement. Washington: Ame 
chap, 14. i os aa l 
Nunnally J. C. Tests and measurements. New York: McGraw Hill, 1959, chaps. 
2 . ; 4 
tion 0 C. and Stanley, J. C. Measurement in today’s schools. (3rd ed.) Engle- 


š e atch. pran Aee TM Leen chaps. 1, 2. 
‘the „ 12 aie E Measurement and evaluation in psychology 
orndike, R. L. and Hagen, EH“ 


3 ark. Wiley. 31, chaps. 1, 2. 
and education, (2nd ed.) New York: Wiley, 1961, chaps. 


ement. In E. F. Lindquist (Ed.), 
rican Council on Education, 1951, 


chapter 


What Is a Good Test? 


Just as horses come in different sizes, colors, 
for different Purposes, tests diffe 
fulness. In the same 


and shapes and are useful 
r in their appearance and intended use- 
sense that it would not be possible to judge a race 
horse by the same standards one would use to judge a dray horse, it is 
not sensible to apply identical standards to different kinds of tests, In 
this chapter we will discuss the majo 
procedures which are used to dete 
functions, 


In discussing differences 


r functions served by tests and the 
rmine how well tests serve those 


among tests, it is easy to focus on relatively 
superficial characteristics, For example, the teacher is likely to be 


concerned with whether tests are multiple choice or essay. School 
psychologists discuss intelligence tests in terms of their being verbal 
or performance, Although these and other such differences in ap- 
pearance are important (and will be discussed in later chapters), 
much more important is to consider three major types of functions 
served by tests, These functions are assessment, prediction, and trait 
measurement, each of which will be discussed in detail in this chapter. 
To say that a test serves a particular function is the same as say- 


ing that it is intended to be useful in making particular kinds of 
decisions, 


The prediction functi 


on will be discussed 
important, but because 


it is the easiest to understand 
Prior discussion of prediction will pave the wa 
assessment and trait measurement. If 
well, it is said to be “valid,” but if n 
Strongly emphasized that a test i 

© test can be said a general sense, Consequently, it is 
not sensible to st 


amp a grade on a test saying “generally good” or 
generally bad”; rather, it is necessary to st 


for this or that Purpose rather th 


14 


first, not because it is more 
and because a 


15 
What Is a Good Test? 


The Prediction Function 


One important use of tests is to predict how individuals will behave 
in other situations. For example, when tests are used to select students 
for college, the major purpose is to forecast how well students will do 
when they actually go to college. If tests are accurate in these predic- 
tions, then they are valid in that sense. Tests are used to make many 
different predictions in addition to performance in college—perform- 
ance of stenographers, likelihood of recovering from mental illness, 
success of pilots in training, tendency for factory workers to have 
accidents, presence of brain damage, and many others. 

In elementary and secondary schools, predictor tests relate im- 
portantly to the concept of “readiness.” A primary concern at all levels 
of education is the readiness of students for succeeding phases of 
instruction, Whereas traditionally a chronological age of five has been 
acceptable for entrance into kindergarten, many children could profit 
from the experience earlier. At the other extreme, even at five years of 
age many children are too immature to successfully participate. Tests 
are valuable in helping to make decisions about the readiness of chil- 
dren for kindergarten. If such tests accurately forecast how well chil- 
dren will get along in kindergarten, then they validly serve a prediction 
function. 

The importance of determining readiness extends from kindergarten 
up through all grades. Intelligence tests are helpful in determining the 
readiness of students to participate in the first grade. Reading-readiness 
tests are intended to predict which students will easily develop reading 
skills and which students will need special help. If ability groupings 
are used in schools, tests often are used to predict how well students 
will perform. 

When a student transfers from one school to another, the records 
that accompany him often are difficult to interpret with respect to the 
new school setting. The two schools may be in quite different localities, 
differ in the over-all levels of ability of their students, and have 
important differences in practices and objectives. Tests often are used 
to predict how well students will perform in new school settings. 

In secondary schools, tests are useful in predicting how well students 
will perform in various aspects of the school program. The test results 
are helpful in making decisions about the entrance of students into 
vocational programs, al courses of study; accelerated courses of 
M ion, honors programs, and others. Near the end of high school, 
PUSIERON ful in making decisions about vocations and 
predictor tests are uselu in 


future schooling. , 
Regardless of what is 
judged in the same way- 


gener 


being predicted, prediction functions are 
A test serves a particular prediction function 


16 
Basic Principles of Measurement and Evaluation 


well if it bears a strong statistical relationship to the intended ae ue 
If a test selects students who subsequently do well in college, it is 
valid for that purpose. If a test helps obtain more stenographers ae 
succeed on the job, then it is valid for that prediction function. If : 
test accurately picks out those preschool children who are in need 9 
special training before being introduced to ordinary first-grade fare, 
then it is serving that prediction function well. In all cases, the predic- 
tion function is judged by the strength of the statistical correspond- 
ence, or correlation, between a predictor test and the behavior to be 
predicted. 27 
In order to determine how well an instrument serves a prediction 
function, it is necessary to have some important behavior to predict. 
The thing to be predicted is called the criterion. For the selection of 
college students, the criterion would probably be average grades at 
the end of four years of study. For the selection of st 
criterion might be ratings by supe 
performed on the job. For the 


enographers, the 
rvisors of how well stenographers 
prediction of first-grade performance, 
the criterion would be grades or ratings made by teachers. In all cases, 
the criterion stands as a measure of the goodness or badness of some 
outcome, and the purpose of the predictor is to estimate the outcome. 
Tests never do a perfect job of predicting outcomes. Consequently, it 
is necessary to measure the degree to which outcomes can be predicted 
from tests, In Chapter 4 will be discussed the statistical measures which 
are useful in studying the extent to which tests are predictive of par- 
ticular criteria. 

How do you recognize a good predictor, say, a good predictor of 
high school grades? Even if you are an expert, you cannot tell for sure 
by looking at the instrument. Only by conducting a study to determine 
how well a test actually predicts a criterion can the validity be deter- 
mined. Sometimes the most unlikely looking test material will prove to 
be predictive of some criterion. Following is an example: 


W#lt°du WIV Lite du 
eRT:xCD eRt:xCD 
&niOSxe ce &niOSxe 
PaSFk-q paSFk-q 


In the above exa 
each pair of lette 
group differs fro 


mple, the purpose is to place a check mark between 
r groups if they are identical and not to check if one 


‘ m the other. Just looking at this example you are 
likely to think that it is only a trivial game. On the face of it nothing 
Important is being measured. However, research has shown that this 


type of test item tends to measure an ability called perceptual speed. 


17 
What Is a Good Test? 


Items of this kind are useful in predicting i ri vork 
where it is necessary to quickly ars te e 5 : 

3 3 g rs, find 
names that are out of alphabetical order, spot particular words in 
written materials, and where other such chores are required. The test 
would be useful in advising students about some commercial courses 


in high school, and it would be useful in counseling students about 


jobs as secretaries and file clerks. 

An employer might rightly question why he should hire people 
because of their ability to quickly recognize similar and dissimilar 
groups of letters such as is required in the type of items shown above. 
What the employer should be told is that, even if the test may not 
ant, previous studies have shown that persons 
selected on the basis of the test (and other such tests) work out well 
on the job. Six months after workers are on the job, those who scored 
high on the test are given better ratings by employers than those who 
scored relatively low on the test. 

Sometimes tests which are used to m: 
deal of intrinsic meaning. Following are some examples. 


appear to be import 


ake predictions do have a good 


1. Indolent means most nearly the same as: 


a. obvious 
b. hard 
c. lazy 
d. untruthful 
Up is to down 
a. kind 
b. hostile 
c. healthy 
d. obstinate 
3. The square root of 121 is: 
a. 242 
by 12.1 
G 17 
d. 11 
4. Add: 
4269 
1954 
62 
5. Bill picks 14 apple 
to save 4 apples for 
did Bill eat? 
6. An automobile r 
2 quarts of W. 
ould have to be 
alcohol? 


as friendly is to: 


to 


f these to his mother. He decides 


s and gives half o 
. 0 at the rest. How many apples 


his father and to e 
a capacity of 16 quarts. Presently it con- 


4 quarts of alcohol. How many quarts of 
and replaced with alcohol to make the 


adiator has 
tains 1 ater anc 
fluid w drained 


mixture half 


Basic Principles of Measurement and Evaluation 


7. The Cathedral of Notre Dame in Paris is an excellent example of: 
a. Georgian design 
b. Gothic design 
c. Byzantine design 
d. Romanesque design 


The reader will recognize these as examples of items measuring vocab- 
ulary, number skills, mathematical reasoning, and general informa- 
tion. Items of these kinds appear on intelligence tests. Such tests 
provide very useful predictive information to help in making decisions 
about the readiness of students for each step in the educational process 
and for special programs of study. Even if some items look important 
and others look unimportant, when prediction functions are being 
studied the real proof of the pudding is in the actual statistical relation- 
ships obtained between the tests and their respective performance 
criteria. 

The criterion is a necessary aspect of any study of a predictor test. 
In many studies the criterion presents a larger problem than obtaining 
good predictors, Two important criteria are school grades and achieve- 
ment test scores, both of which reflect the extent to which students 
have mastered the intellectual objectives of instruction. Both are very 
sensible criteria, and they are widely used in determining the validity 
of predictor tests. However, many of the important objectives of in- 
struction are not entirely mirrored in grades and achievement test 
Scores. Among the other objectives of instruction are to (a) develop 

dards, (b) promote social adjustment, and (c) 
arning. Although these criteria are equally important 
as the more intellectual ones, they are much harder to measure, 

A good example of the importance of attitudinal (nonintellectual ) 


criteria is in making decisions about the possible acceleration (double 
ee of a very bright student, Tests indicate that the student 
Wi a 


ave no trouble in mastering the subject matter; but what effects 
will it have on the attitudes of the student to be removed from his 
classmates, to be pushed into an older age group, and to be singled 
out as a “genius” by parents and friends? Presently we have only ap- 
proximate methods for measuring attitudinal effects of decisions about 
ban These methods are discussed at a number of places in the 

ook, 

An important point t 
in school situatio 
predicted (the criteria) 


© remember is that when predictor tests are 
ns, in estimating readiness), the things being 
should be kept in mind. Only if the predictor 
r ical relationships to the criteria are they of any 
not good enough, Regardless of how appealing tests 

may prove to be very poor predictors when the 


19 
What Is a Good Test? 


Assessment Functions 


In contrast to the prediction function, the assessment function con- 
cerns a direct measurement of the effectiveness of performance at a 
particular point in time. An outstanding example of the assessment 
function is seen when the teacher tests his class for progress in a unit 
of instruction. When the eighth-grade teacher uses a test in geography, 
the purpose is to directly measure how well students have met the 
objectives of instruction in that topic. The purpose of the test is not 
necessarily to predict anything the student will do in the future, 
although it would be hoped that grades in the geography class would 
relate to future performance. Even if it were sensible to judge class- 
room tests by the prediction function, it would be extremely difficult 
to determine what such tests should predict and even more difficult to 
carry out the studies. It is conceivable that a student could do an 
excellent job in eighth-grade geography yet never take another course 
in geography, never take another course in which geography was even 
indirectly involved, and never work at a job where geography was 
important. 

Another example of an assessment function is the use of standardized 
achievement tests. When standardized achievement tests are given to 
fourth-grade children, the purpose is to directly measure their over-all 
performance up to that time. Whereas it was said that items used on 
predictor tests need bear no obvious relationship to what is to be 
predicted, items on an assessment instrument should be obviously 
related to the performance to be measured. 

What are some of the essential ingredients of a valid assessment? 
First, an assessment is always a sampling of the behaviors that connote 
good or bad performance in a particular situation. On a spelling test 
it is not possible to include all the words that the student might have 
encountered. In a test of number skills, only a relatively small sample 
of all possible problems is included. The notion of sampling is inti- 
mately related to assessments. A good assessment is one that covers a 
representative sample of the important behaviors in a particular 
situation. 

A key term used in our definition of a good assessment is that of 
important behavior. The items in a good assessment are not randomly 
drawn from all the things that are seen and heard in the performance 
situation. Hypothetically, one could write down on slips of paper all 
the things that were seen and heard with respect to a unit of instruc- 
tion, including everything in the textbooks and everything that the 
teacher said. The test questions then could be randomly selected from 
the total collection of events. If they were randomly selected, the 
teacher could rightly ask questions such as “Whose picture is on page 


20 


Basic Principles of Measurement and Evaluation 


96 of the textbook?” “What country were we discussing on April 22?” 
“What color dress was the teacher wearing last Tuesday?” OF course, 
such incidental facts would not find their way into any good test. The 
reason is that a good assessment instrument is restricted to the im- 
portant aspects of performance, 

How do you determine what the important behaviors are in assess- 
ment situations? This is necessarily a matter of human values. Class- 
room testing involves values—values of the particular teacher, values 
of the school, and values of American education in general. The values 
which are used to determine the sampling of content for an assessment 
originate in some individual or some group that has the responsibility 
for making such decisions. The teacher who tests the students on 
knowledge of geography has the responsibility to evaluate perform- 
ance. He determines what behaviors are relatively good and relatively 
bad, and he tries to tap those behaviors with tests. He includes items 
of particular kinds because he considers them to be important. Others 
he excludes because he believes them to be relatively 

OF course, wherever values exist, differences in y 
viduals also exist, Some of the te 
consider important w 
teachers. This h 


unimportant. 
alues among indi- 
st materials that one teacher would 
ould not be considered important by other 
appens in all types of assessments—in schools, in 
industry, in military settings, and in other places. Fortunately, in most 
places where assessments are made there is rel 
about the types of content that should be considered. For example, if 
three different teachers compose tests for eighth-grade geography 
classes, although there would be some differences in emphasis, in the 
main their tests probably would measure much the same thing. 

As another example of how assessments necessarily depend on 
human values, two teachers might have somewhat different goals for 
Instruction in arithmetic. One teacher would tend to stress memoriza- 
tion of arithmetic facts and principles; the other would stress arithmetic 


sic a These differences in objectives would lead the two teachers 
© compose somewhat different examinations for measuring progress in 
arithmetic, a 


atively good agreement 


Because assessments inevitably 


rest upon human values, it is not 
meant to i 9 e 5 
ant to imply that teachers do or should compose their examinations 


without recourse to widely accepted standards, In education some 
values are held quite widely, and most teachers try to implement those 
in their instruction as well as in their evaluation of students. For 
teachers will agree that it is more important for 

n e sane Principles about the subject matter rather 
Comoe ca a 1e rote gis of miscellaneous details. 
Ges ooo: as possible, most teachers would like to have 
aminations relate most prominently to general principles em- 


P| 


S.C.E R.T., West Benga) 
Date IB.. J.. C I.. os 


What Is a Good Test? 
Acc. No. QS. 


bodied in the subject matter. In education the problem of effective 
assessment is more a problem of implementation than one of differing 
values among teachers. 

Regardless of how well founded the value system behind classroom 
instruction and behind classroom testing is, adequate implementation 
of the values in the form of effective tests requires special efforts and 
skills. One of the major purposes of this book is to show teachers how 
to apply their efforts most effectively in constructing classroom tests. 

Commercially distributed achievement tests are assessments in the 
same sense that classroom examinations are, and they depend on the 
same standards of validity. Prior to the construction of most batteries 
of achievement tests, groups of educators are consulted about the 
subject matter emphases that should be employed. For example, before 
composing a comprehensive achievement test battery to be employed 
in the eighth grade, groups of educators had to answer such questions 

s “How much emphasis should be placed on the social sciences?” 
“Should there be items on geography in the test?” “What maximum 
level of mathematical training should be assumed?” Working with 
educators, test constructors are able to arrive at a subject-matter 
coverage which fits the majority's views. Test specialists are then 
available to ensure that items are carefully written. Regardless of the 
elaborateness with which the content is determined and the avail- 
ability of experts for the construction of achievement tests, they do 
not differ in principle from the midsemester examination in tenth- 
grade algebra. 

Suppose someone questioned the validity of the tests that you use 
in your own classroom. How could such arguments be settled? No 
amount of empirical study would completely answer the questions. No 
amount of elaborate statistics would provide the key. Questions of 

validity with assessments inevitably must resort to a rational appeal to 
the coverage of content areas. 

In discussing the validity of particular assessments, some relevant 
questions are as follows: “Why did you not include questions on the 
Civil War period in your test?” “W ‘hy did your test contain so few 
items?” “Why did you use a true- false rather than an essay test?” To 
the extent to which a teacher can provide sensible answers to questions 
like these and others relating to the nature of the content coverage, he 
gains assurance that good assessments are being made. If the teacher 
has no good reasons for the content coverage or the mode of presenta- 
tion of the items, some questions should be raised about the effective- 
ness of the examination. Trying to answer such questions will often 
show the teacher that he is not doing a very effective job of implement- 
ing his own values about the course or more generally held values 
about the particular subject matters . > 


75 IAA 


oo 


22 


Basic Principles of Measurement and Evaluation 


One way to ensure some validity for an assessment is to esp 
outline the goals to be implemented in a course of orm tors 
then to compose examinations relating to that outline. This will help 
demonstrate in a direct way that a broad coverage of important content 


areas is included in examinations. Examples of such outlines will be 
discussed in Chapter 5. 


Measurement of Psychological Traits 


All teachers are familiar with the two test functions which were 
discussed previously: prediction and assessment. They 
with intelligence tests and special aptitude tests used to predict success 
in various subject-matter areas. Teachers are highly familiar with 
assessments, not only in the form of standard achievement tests, but 
through the many tests which they compose for their own students. 
There is another important test function, one with which teachers 
usually are not very familiar. This is the measurement of psychological 
traits (or “constructs” ) to be used in educational research, For ex- 
ample, psychologists have many theories about the construct of anxiety 
and perform many experiments relating to anxiety. One theory is that 
a moderate amount of anxiety facilitates certain kinds of learning. In 
order to perform experiments to test such a theory, it is necessary to 
measure relative amounts of anxiety in students. Many procedures 
have been Suggested for measuring anxiety, including physiological 
measures such as electrical resistance of the skin, personality inven- 
tories, and others, But how do you tell if any or all of these actually 
measure anxiety rather than something else? 


Can the Proposed measures of anxiety be validated by the same 
standards applied to predictor tests? 


No, the Purpose of the test is not 
to predict something else but to measure anxiety then and there. There 
is no single criterion that can be used to test the test. Can the test of 
anxiety be validated as an assessment in the same sense in which 
achievement tests and classroom examinations are judged? No, that 
standard cannot be applied either, There is no obvious content to be 
outlined and represented in an examination in the way that this can 
be done for a course in geography or algebra. The test items do not 
obviously Measure anxiety, and some proof must be given, 

Another type of test that would need construct v; 
a measure of social adjustment, In the te 
tions about how well they get 
know if the test actually me. 
should be obvious that validity cannot be determined in the same way 
that it is determined with a predictor test or with achievement tests. 
Something else is required, 


are familiar 


alidation would be 
st, students are asked ques- 
along with fellow students. How do we 
Wures social adjustment? Iere again it 


23 
What Is a Good Test? 


As another example, the following item is used on an inventory to 
measure dominance versus submission (1). 


Someone tries to push ahead of you in line. You have been waiting for 
some time, and can’t wait much longer. Suppose the intruder is the same 
sex as yourself, do you usually: 


Remonstrate with the intruder 

Call the attention of the man at 

the ticket window — 
“Look daggers” at the intruder or 

make clearly audible comments de 
Decide not to wait and go away — 
Do nothing — 


It would be easy to list hundreds of such tests which are intended to 
measure psychological traits. Here only self-report inventory type items 
were shown. In practice, many other types of test materials are used in 
the attempt to measure psychological traits including measures of 
perception, ink blot tests, behavior in group situations, and others. 

How do you determine the vi alidity of such trait measures? Measures 
of psychological constructs gain meaning only after they have been 
applied in many different circumstances. If a new test is a valid 
measure of a psychological construct, there are many experimental 
results to be expected from the measure. For example, a valid measure 
of anxiety should show differences between students diagnosed by 
school psychologists as being anxious and normal students. Scores on 
the test should change when students are in anxiety-provoking situa- 
tions such as immediately before an important final examination, Many 
more such relationships would be expected. If the new measure fits in 
well with these expectations, if it correlates with those variables where 
correlations are expected and shows differences where differences are 
expected, this increases our faith in the validity of the measure. 

It would more likely be the case that some of these expected results 
would be obtained and others would not be obtained. This shows the 
limits of the validity of the new measure and helps redefine the meas- 
ure. For example, after many such studies it might be found that the 
proposed measure of anxiety does seem to measure anxiety related to 
possible physical harm, such as in athletic contests, but does not 
measure anxiety provoked in social situations, such as when embarrass- 
ing events occur in the classroom. 

For the previously mentioned measure of social adjustment, a num- 
ber of comparisons would help support the construct validity of the 
instrument. One meaningful comparison would be with ratings by 
teachers of the social adjustment of their students. Another comparison 


24 


Basic Principles of Measurement and Evaluation 


would be with ratings by students of the characteristics of 1 18 
mates. For this purpose a list of questions could be used, such as fe R 
starts the most arguments?” and “Who tries most to help others: 4 
the test of social adjustment shows strong statistical rolations with 
these and other indicators of social adjustment, it provides wee 
validity for the use of the test in subsequent research. In construct 
validation, a new measure is tested against those variables and situa- 
tions where everyone will agree that relationships would be expected. 
For example, there is little doubt that clinical cases diagnosed as being 
anxious do, on the average, possess considerably more anxiety than 
unselected groups of normal students. Few individuals would argue 
the assumption that final examinations generate some anxiety in stu- 
dents. Similarly, it is reasonable to expect a measure of social adjust- 
ment to correlate with teachers’ ratings of social adjustment and ratings 
by students of the social characteristics of their classmates. Construct 
validation is performed where there is high agreement that relation- 
ships are to be expected. After an instrument performs in many such 
situations, then it becomes relatively safe to trust the meaning of the 
measure in situations where the result is unsure. For example, after a 
measure of anxiety has shown the expected relations in many different 
experiments, then we can apply it in a new situation where the out- 
come is in doubt. For example, the measure can then be used to test 
the theory that anxiety facilitates the learning of certain kinds of 
simple material, 

In essence construct valid 
meaningful relations betwee: 
measures of the 
can be truste 
subseque 


ation consists in Weaving a network of 
n a new measure and other supposed 
same trait. If such relations hold, the measure then 
d in subsequent research, If such relations do not hold, 
nt research with the instrument should be held suspect. 


Relations among Prediction, Assessment, and Trait Mea: 


It was said that the 
assessment, 


surement 
three functions of measurement—prediction, 
and trait measurement—are validated, respectively, by 
the accuracy of specific forecasts, the completeness of coverage in a 
content area, and the degree to which expected rel 
tained in a variety of situations, It should not be 
functions are entirely unrelated, Here we will de 
Ways in which the three interact. 

The three functions of me 
one instrument may at one time or anoth 
of these, For example, a classroom e 
history, could be used to ma 
The midterm ex 


ationships are ob- 
thought that these 
scribe some of the 
asurement are related in the sense that 
er serve two or even all three 
xamination, say one in ancient 
ike predictions about future 


performance. 
ancient history prob: 


amination in ably would be an ex- 


25 
What Is a Good Test? 


cellent predictor of final grades in ancient history and might also be 
predictive of future course work, even though it was not specifically 
constructed for that purpose. 

Achievement test batteries are constructed as assessments, and their 
primary validity depends on the reasonableness of their coverage as 
judged by teachers and other educators. However, achievement tests 
are also excellent predictors of future academic achievement. One of 
the best predictors of performance in college is a comprehensive 
achievement test given at the end of high school. The important con- 
sideration is what function is being exercised for an instrument at a 
particular time, and the instrument must be validated for that func- 
tion. If an achievement test is being used, as it usually is, to provide 
a direct assessment of the progress of pupils, then it must stand as an 
assessment and be judged accordingly. If in addition it is used to make 
predictions about future academic or vocational success, then in those 
instances the achievement test must be validated as a predictor and 
must show strong correlations with future indices of success. 

Although it was said that trait measurement is not the same as the 
prediction function, in a sense predictions are at the heart of the mat- 
ter, The difference is that with the prediction function the effective- 
ness of an instrument stands or falls on one comparison, that between 
the test and the criterion. In construct validation (trait measurement), 
rather than there being one specific prediction to be made, a host of 
predictions is made, no one of which is crucial. Whereas in prediction 
you can state the validity in terms of one statistical index, in construct 
validation there is no one measure that can be applied, such as to say 
that the test is a 70 per cent pure measure of anxiety. 

Whereas it was said that the validity of assessments mainly is deter- 
mined by an appeal to the representativeness of the content coverage, 
other validation procedures also provide useful information. For ex- 
ample, it is helpful for the teacher to inspect the difficulty of test items. 
If the teacher intended a particular item to be relatively easy and, in 
fact, it is found that practically none of the students get the item cor- 
rect, the teacher would certainly have reason to reconsider the item 
and how it is stated in the test. It would be helpful when constructing 
achievement tests to study the amount of improvement which students 
show throughout a course of training. For example, if an achievement 
test in high school mathematics is being developed, it would be inter- 
esting to see how much improvement there is on the test from before 
to after. Obviously if students made no better scores after the course 
of training than before, either the course of instruction or the test was 
no good. Those items on which the most improvement occurs usually 
relate more strongly to the actual content of the course. On some 
types of items a great deal of progress might be shown and on others 


26 


Basic Principles of Measurement and Evaluation 


relatively little progress would be shown. These eikeen 0 a 
provide hints to the test constructors in regard to how to impro 
i nent test. 

pow the types of statistical results mentioned above and 15 0 
are helpful in constructing assessments, it is important to keep in min : 
that these are secondary to the use of human judgment to saci 
the appropriateness of content coverage. Statistical results do not ; ic 
tate how assessments are composed but rather they provide Chee 
about how to improve achievement tests and classroom examinations. 
For example, even if students fail to get a particular type of item cor- 
rect on the classroom examination, this does not mean that the items 
necessarily are inappropriate or poorly formulated, It may be the fault 
of the students, and the teacher may decide to retain the specific type 
of item on future tests and instruct students to do better in the particu- 
lar content areas represented. Similarly, it would not be correct t 
judge achievement test items purely in terms of the relative amounts 
of change from before to after. If this were the case, then an excellent 
item would be “What color is your textbook?” Many of the students 
will not know the answer before the course gets under way, but nearly 
all of them will know it afterward. Statistical evidence is not the pri- 
mary basis for the construction of assessments, but it provides many 
hints about how assessments can be generated and improved. 

Whereas it was said that the prediction function does not necessarily 
depend on the instrument looking like it is important, this does not 
necessarily mean that test material always is or should be nonsensical 
in appearance, Although the proof of the pudding is in the statistical 
relationships which are obtained, it certainly gives one more initial 
confidence in choosing and studying a predictor if the content ap- 
pears, intuitively, to be importantly related to what is to be predicted. 
Selecting and trying out tests to serve a prediction function should 
never be done randomly. The better you are able to guess initially 
which predictors will work, the quicker the problem is solved. Conse- 
quently, the intuitive appeal of the test content may determine those 


instruments which are studied for their actua] predictive power in 
particular situations, 


S ummary 


It is never safe to accept a new test at face v 
dence is obtained to show tl 
the same way th 
cure for hayfeve 
tually works, it j 
fourth-grade 


alue until some evi- 
lat it does what it is Purported to do. In 
at it would be foolish to accept a new medicine as a 
r until some evidence is shown that the medicine ac- 


s unwise to accept so-called measures of intelligence, 
achievement, ne 


uroticism, and others until the evidence 


27 
What Is a Good Test? 


is shown. In school settings tests are worthwhile (valid) only if they 
help in making educational decisions. Special investigations must be 
undertaken to ensure that particular tests actually serve their intended 
functions well. 

Although the types of decisions in which tests potentially provide 
helpful information are legion, these can be classified into three broad 
classes, or functions. The simplest function to understand is that of 
prediction, in which the worth of the instrument is judged by the ex- 
tent to which it accurately forecasts some important behavior. This 
function is exercised when tests are used to place underage children in 
the first grade, when tests are used to assign children to different levels 
within grades, and when they are used to counsel students about col- 
lege training. Tests used in these ways serve their functions validly 
only if they actually forecast how well students will perform. Validity 
is determined by correlating test scores with criteria of performance 
obtained later. 

The second function which tests serve is that of assessment, which 
concerns the performance of a student in a unit of instruction or the 
over-all progress of a student in school up to a particular point in 
time. Primary examples of assessments are achievement tests and tests 
composed by the teacher to measure the progress of his students. As- 
sessments principally are not validated by empirical studies and statis- 
tical analyses of results, but rather their validation consists of a care- 
ful inspection of the content relative to the comprehensiveness of its 
coverage. 

The third function which tests serve is to measure various psycho- 
logical traits such as anxiety, dogmatism, acquiescence, and others. 
Such tests primarily are useful in educational research. Tests intended 
to measure psychological traits cannot be validated solely by their 
statistical relationships with criteria (as is the case with predictor in- 
struments) nor solely by a study of the content (as with assessments). 
What must be done is to relate such instruments to numerous other 
tests and real-life situations in which their respective traits supposedly 
occur, Essentially this is a “bootstrap” operation in which the worth 
of a new measure is determined through many correlations with other 
measures. Although teachers are seldom directly concerned with meas- 
ures of psychological traits, they need to know something about them 
in order to interpret the results of educational research. 


Suggested Additional Readings 


American Educational Research Association and National Council on Measure- 
ments Used in Education, Committee on Test Standards. Technical recom- 
mendations for achievement tests. Washington: National Education Association, 


1955. 


28 


Basic Principles of Measurement and Evaluation 


American Psychological Association. Technical recommendations for psychological 
tests and diagnostic techniques. Washington: American Psychological Associa- 
tion, 1954. (Also in Psychol. Bull., 1954, 51, No. 2, Part 2.) 

Anastasi, Anne. Psychological testing. (2nd ed.) New York: Macmillan, 1961 
chaps. 6, 7. 

Cronbach, L. J. Essentials of psychological testing. (2nd ed.) New York: Harper 
& Row, 1960, chap. 5. 

Cureton, E. E. Validity. In E. F. Lindquist (Ed.), Educational measurement. 
Washington: American Council on Education, 1951, chap. 16. 


> 


chapter 


Scores, Norms, 


and Statistics 


One of the great advantages of standardized measures is that they 
supply numerical results—inches of height, number of words correct 
on spelling tests, and IQs. Because of these numerical results, it is 
possible to apply many types of mathematical treatments to test scores. 
Mathematical treatments are very helpful in answering questions like 
(a) How much has the spelling ability of Sam Parks improved from 
the beginning to the end of the fourth grade? (b) How well are stu- 
dents at Wiley School progressing in arithmetic relative to the progress 
shown by students in other schools across the country? (c) How 
much above average in general intelligence is Sarah Russell? (d) Is 
Jack Franklin performing as well in school as his abilities permit? 

Unfortunately, some teachers in training are afraid of mathematics, 
and they fail to learn some elementary principles that would greatly 
facilitate their use and interpretation of test results. This is particularly 
sad because the mathematical procedures required in educational 
measurement are simple indeed. It is not possible to fully comprehend 
how tests are constructed, standardized, used, and interpreted without 
first understanding some simple mathematical principles concerning 
scores, norms, and statistics. 

The need for mathematical analyses of test results can be illustrated 
by a typical event in the day of a student. Susie tells her mother that 
she made 44 in spelling and 18 in arithmetic. The mother is happy 
about the spelling grade but worries about Susie’s progress in arith- 
metic. Of course, the mother does not have sufficient information to 
make such decisions about Susie’s progress. If the spelling test had 
100 words and the teacher expected students to know the majority of 
them, then 44 was a very poor score. If the arithmetic test contained 
only 25 problems, then 18 might have been a very good score. 

The direct results of tests are referred to as raw scores. For example, 
such scores would be the number of correct answers on a true-false 


29 


30 
Basic Principles of Measurement and Evaluation 


test and the sum of points obtained on an essay test. Raw scores are 
seldom directly meaningful until they are compared with standards. 
The first standard is a statistical one, that of how well the group as a 
whole performs. If it can be said that Susie's score is above average, 
or much above average, this offers a basis for interpretation, The sec- 
ond standard is one of values. Regardless of who is average, above 
average, or below average, the teacher might decide that as a group 
the students have done poorly or done well. Indices of group perform- 
ance are called norms. One of the most important ways in which 
teachers indicate their standards and goals of instruction is in the as- 
signment of grades. In assigning grades, teachers are influenced by 
normative standards but are not completely dependent on them. In 
this chapter we will consider normative standards; in Chapter 7 we 
will discuss teachers’ evaluations. 


Measures of Average Performance 


Much of what happens in the classroom depends on what the “aver- 
age” student can and does do. In many ways the average student is 
king, because he sets the pace for the whole class; and it is only by 
comparison with his performance that the extremes of good and poor 
performance can be recognized. Also, the goals of instruction largely 
are judged by how well the average student progresses. For these 
reasons, it is important for teachers to learn some principles regarding 
the calculation and interpretation of measures of average performance. 


; The word average” has several meanings. As a prelude to discuss- 
ing them, let us look at some typical raw scores, the scores of eleven 


Arithmetic Spelling 


Johnny 22 48 
Fred 12 52 
Mary 14 49 
Bill 12 51 
Kim 14 55 
Susan 14 52 
Michael 17 50 
Sharon 19 62 
Harry 11 56 
Patricia 15 52 
Eric 20 75 
ee 


students on an arithme 


tic and a spellin 
sets of scores, 


5 > g test. By looking at the two 
the experienced teacher 


would have an approximate 


31 


Scores, Norms, and Statistics 


idea of how well the group as a whole performed (the average) 
and how widely the students varied in performance (the “dispersion” ). 
Rather than rely solely on a subjective evaluation of the results, there 
are some simple measures which can be used to index the average per- 
formance and the dispersion of performance. Not only are the meas- 
ures more precise than a subjective evaluation of the results would be, 
but they also have an additional advantage: they greatly facilitate the 
communication of results to other persons. If Mrs. Wilkins wants to 
show other teachers how much better one section of ancient history 
performs than another, one way to do this would be to show the entire 
sets of test results, More to the point, and much easier to understand, 
would be to show the precise average scores in the two sections. Simi- 
larly, in communicating test results to students, parents, and others, 
communication is facilitated by saying that a particular score is “above 
average” or “much above average.” Even better is to state the per cent 
of students which Susie surpasses on a particular test. 

Just as there is more than one way to skin a cat, there is more than 
one measure of average performance and more than one measure of 
dispersion in performance. First, we will look at some of the measures 
of average performance and discuss their relative advantages and dis- 
advantages. Later we will discuss measures of dispersion. 

Before discussing measures of average performance, it is important 
to distinguish between score points and scores. Score points are all 
possible numerical values on the test continuum. If, for example, the 
test consists of 100 multiple-choice items, the range of possible score 
points is from zero to 100. Scores are the actual numerical results ob- 
tained from a particular group of students. There are two reasons why 
it is important to distinguish scores from score points. First, with sets 
of test results scores usually are obtained for only some of the possible 
score points. In the example above, even though there may be a score 
point of 100, it might be the case that no student would score at that 
point. When there are many items in the test and only a few students 
are tested, inevitably there are many more score points than are repre- 
sented by actual scores at those points. The second reason why it is 
important to keep the distinction in mind is that score points include 
all numerical values including fractional values, but usually it is the 
case that raw scores are expressible only in whole numbers. In meas- 
uring average performance, it often is found that the average is a frac- 
tional number like 41.26. If, as is usually the case, it is possible only 
to receive scores expressible as whole numbers (e.g., 41 or 42, but 
nothing in between), then the average must be thought of as a score 
point rather than an actual score. Keeping in mind the distinction be- 
tween scores and score points will facilitate an understanding of the 
material in this and following chapters. 


32 


Basic Principles of Measurement and Evaluation 


Mode. The measure of average which is the easiest to compute, 
and in many ways the easiest to understand, is the mode. Quite simply, 
the mode is the most frequently occurring score. In the arithmetic test, 
the most frequently occurring score is 14, with three students (Mary, 
Kim, and Susan) achieving that score. The mode is 52 on the spelling 
test. Although the mode is sometimes a useful measure, it has several 
faults, particularly when it is determined on as few students as are in 
most classes. The mode is a relatively “unstable” measure; i.e., it changes 
considerably when different groups are tested. If a teacher used the 
mode as the preferred measure of average and he applied the same 
test to several different classes, he would find rather marked shifts in 
the mode due purely to the happenstance of some brighter pupils 
being in one rather than another class. Even if the same, or a very 
similar, test is administered to the same cl 
mode is likely to change by relatively large amounts—more so than 
the two measures which subsequently will be discussed. 

Another fault of the mode is that it is indeterminate when two 
Scores occur with equal frequency. Suppose that on the arithmetic test 
three students had scored 13, and three had scored 14, and no other 
Score occurred more than twice, Then which is the mode, 13 or 14? We 
might cut this Gordian knot by saying that the mode is 13.5, More 
likely what would happen would be that scores as divergent as 12 and 
16 would be competing as the mode with equ 
Then it is relatively meaningless to talk 
help to talk about modes in the plural. 

As if these faults Were not enough, the 
for the development of other st 
One standard which is employ 
matical convenience, It ofte 


ass on two occasions, the 


ally high frequencies. 
about the mode and not much 


mode is a poor starting point 
atistics, e.g., Measures of dispersion. 
ed in selecting measures is that of mathe- 
n is much easier to develop additional 
mathematical procedures when starting from one measure than from 
another. By this standard the mode is a rather poor measure, 
For the same reason that it is valuable to know the bad signs as well 
as the positive features in purchasing a horse, it is helpful to know 


some of the potential faults of Statistical measures. Only by under- 
standing the faults of some measures is it possible to appreciate the 
advantages of others. 


Median. 
score 
divide 


Another measure of 


4 average (central tendency) is the 
point (unlike the mode, not 


necessarily an actual score) which 
vides the students into two equal groups, where 50 per cent score 
higher and 50 per cent score lower. This is called the median. If a stu- 

ent scores above the median, it can be said that he is in the top 50 
per cent. If he scores below the median, he is in the bottom 50 per 


8 The median is easily interpreted and easily communicated to 
others, i 


33 


Scores, Norms, and Statistics 


To determine the median, start by ranking the pupils, writing at the 
top of the page the name of the student who makes the highest score, 
next the second highest score, and so on to the student who makes the 
poorest score. There will usually be many ties; e.g., Johnny will score 
the highest with 26 and thus be ranked 1, and three students will come 
next with identical scores of 25. Rank the ties arbitrarily, giving ranks 
of 2, 3, 4 to the three in whatever order they appear. In this way every 
student will receive a rank, and the ranks will run from first to last, 
the last being ranked 11 or 30, or however many students took the 
test. 

If an odd number of students took the test, the median can be de- 
fined as the score made by the middle person. If there are eleven stu- 
dents and their scores are ranked, the median is the score obtained 
by the student ranked sixth (five students above and five below). If 
there is an even number of students, the median falls (theoretically ) 
halfway between two scores. If there are ten students and the one 
ranked fifth has a score of 18 and the one ranked sixth has a score of 
17, the median can be stated as 17.5. 

The approach to determining the median described above is some- 
what disrupted by the extent to which there are tied scores at the 
median. To illustrate the difficulty, suppose that five students score 18 
or higher, five score 17, and one student scores 16. Then by the rule 
stated above, the median is said to be 17, but this is rather misleading. 
According to the strict definition of the median, 50 per cent of the stu- 
dents are supposed to score less. In this instance, if a student had a 
score of 17, he would be at the median, which implies that about 50 
per cent of the students score lower. In fact, only one student scores 
lower, This is a very unusual circumstance, but to a lesser degree the 
same type of problem occurs in most instances in which the median is 
used. More typically, 47 per cent of the students would score 18 or 
higher, 38 per cent would score 16 or lower, and 15 per cent would 
score 17. By the rule stated above, the median would be said to be 
17, but actually there are more (47 per cent) above than below (38 
per cent). 

When there are tied scores at the median, as there nearly always 
are in practice, the median is not uniquely defined and must be esti- 
mated. One method of estimating the median was stated above. A 
more precise way is to interpolate, i.e., in the example above to obtain 
a hypothetical figure somewhere between 17 and 18, such as 17.2 or 
17.4, A method for so interpolating is discussed in Appendix C-5. Al- 
though it is useful to make such interpolations in large-scale test de- 
velopment projects, it is seldom worth the effort in determining the 
median for classroom examinations. For most practical purposes, the 
rule stated above will suffice. To reiterate, rank the students, and if 


34 


Basic Principles of Measurement and Evaluation 


an odd number of students is tested, find the middle ranked student 
and designate his score as the median; if an even number is tested, 
find who has the lowest score in the top 50 per cent and designate his 
score as the median. According to the rule, the median is 14 on the 
arithmetic test. Of the eleven students, five make scores higher than 
14 and the remaining six make scores of 14 or lower. On the spelling 
test the median is 52, with three students scoring at the median. 

Teachers will find the median a useful index of central tendency. 
Its advantages are that it is easy to compute when there are no more 
than twenty or thirty students, and it is relatively easy to explain to 
others. The disadvantages are that, if many students are tested and 
exact estimates are needed, the median is “messy” to compute; and 
applying a standard used with the mode, it does not offer as much 
mathematical convenience as the measure which will be discussed 
next. 

Mean. A measure of central tendency is available which can be 
easily computed and easily understood and can be used in many mathe- 
matical developments. The mean is obtained by adding all the scores 
on a test and dividing the sum by the number of students tested. The 
sum of scores on the arithmetic test is 170, Dividing this by 11 (the 
number of students) gives a mean of 15.45. In the same way a mean 
of 54.73 is found for the spelling test. 

The mean is the measure that we most frequently think of when the 
word “average” is used. If there are many items on the test and many 
students taking the test (say at least twenty of each), the mean, 
median, and the mode usually are very much the same. In the arith- 
metic test the mean, median, and mode are, respectively, 15.45, 14, and 
14. On the spelling test they are 54.73, 52, and 52. r 

If the mean has a fault, it is th 
scores. The story is told of the 
average wealth in a very sm 


at it is strongly influenced by extreme 
census taker who sought to find the 
all town. He found seven persons penniless 
and one person worth ten million dollars. By studiously computing the 
mean, the census taker came to the conclusion that the “average” per- 
son was a millionaire. In a less dramatic way, an extreme score had a 
marked effect on the mean of the spelling scores. Erics score of 75 is 
so divergent from the others that it pulled the mean upward making 
it 2.45 score points higher than the mode and the median, Fortu- 
nately, such extreme scores are unusual. When they occur, the median 
usually is preferable to the mean as a measure of central tendency. 

i The mean is the preferred measure of central tendency to be used 
in most situations. Where there are very extreme scores or where ex- 


Pany an important consideration, the median can be used 
instead. 


35 


Scores, Norms, and Statistics 


Stability of the Average. It is easy to overinterpret the mean, 
median, or mode obtained from one group of students. Whereas the 
obtained average is precisely the average for that particular group, 
it is only a rough estimate of the average score to be expected from 
other groups. Regardless of whether students in the particular school 
or the particular class tend to be relatively bright or dull, there is al- 
ways considerable variability. Purely by chance, the level of ability 
tends to fluctuate from class to class, from section to section, and from 
year to year. Consequently, some fluctuation of average test scores is 
to be expected from section to section and from year to year. Such 
fluctuations should not necessarily be taken to mean that either the 
instruction has improved or worsened or that the efforts of students 
have changed. To arrive at such conclusions, it is necessary to perform 
controlled experiments. 

Deviation Scores. After the mean is obtained, the performance of 
each student can be described in terms of his relative position above 
or below the mean. On the arithmetic test Eric is 4.55 score points 
above the mean, and Patricia is .45 score points below the mean, which 
indicates that Eric is above average and Patricia is below average. 
Scores stated in reference to the mean are called deviation scores, 
which are obtained by subtracting the mean from each of the raw 
scores. A positive deviation score indicates that the student is above 
average; a negative deviation score indicates that the student is below 
average. The full set of deviation scores on the arithmetic and spelling 
tests (the raw scores of which were listed earlier) are as follows: 


Arithmetic Spelling 
deviation scores deviation scores 
—6 73 


Measures of Dispersion 


Equally important to understanding measures of average perform- 
ance is to understand measures concerning the spread, or dispersion, 
of scores about the average. In neighboring schools two fourth-grade 


36 


Basic Principles of Measurement and Evaluation 


classes could show the same average performance on an achievement 
test yet differ importantly in their dispersions of scores. In one class 
scores could be packed tightly about the mean with no very high or 
low scores. In the other, scores could range from those to be expected 
of students several grades higher to those of students who are func- 
tioning at the second-grade level. Of course, such differences in dis- 
persion would have many implications for sectioning of students ac- 
cording to levels of ability and many implications for daily classroom 
practices. 

Before we can meaningfully interpret particular deviation scores, 
we must learn how widely scores are scattered above and below the 
mean. A deviation score of 2.00 would represent superior performance 
if all the scores are closely packed about the mean. But if there are 
deviation scores as high as 100 and as low as —100, a deviation score 
of 2.00 would indicate near average performance, Consequently, we 
need a measure of the spread, or dispersion, of scores about the mean 
to help interpret particular deviations, As was true of the “average,” 
there are various measures of dispersion that can be used. Two of the 
most useful ones will be discussed. 

Range. One very simple index of dispersion, the range, is obtained 
by subtracting the lowest score from the highest score. (Some authors 
say add one to the highest score minus the lowest score. Defined this 
way, the range includes both the top and bottom score. The simpler 


definition given above is the one which is most widely used.) The high- 
est score on the arithmetic test is 22 and the lowest is 11, giving a 
range of 11. The range of scores on the spelling test is 27, indicating 
that the dispersion of scores is greater in spelling than arithmetic. 
The range is a quickly obtained and often used index of dispersion. 
The major fault of the range is that it de 
highest and the lowest. Consequently, it 


not taken the spelling test, the range 
of 27. In other words, if by chance Eric had been in another school or 
another section at the same school, the teacher would have come to a 
very different conclusion about the dispersion of scores on the test. 
Another fault of the range is that it does not offer a good starting point 
for the development of other statistics. The range is recommended as a 
measure of dispersion only when expediency is important and there 
are no very extreme scores. When time permits its calculation, a much 


better measure of dispersion is the standard deviation, which is dis- 
cussed in the following section, 

Standard Deviation. 
deviation is to squ 


pends on only two scores, the 
is rather unstable. If Eric had 
would have been only 14 instead 


The first step 
are each of the devi 
summed and the sum divided by the 


in developing the standard 
ation scores. Then these are 
number of students. The resulting 


37 


Scores, Norms, and Statistics 


measure is called the variance. The square root of the variance is an 
even more useful index of dispersion. It is called the standard deviation 
(SD). The variance and SD for the arithmetic and spelling tests are 
obtained as follows: 


Arithmetic Spelling 
squered deviations squared deviations 
42.90 26 
11.90 7.45 
2.10 32.82 
11.90 15.91 
2.10 07 
2.10 7.45 
2.40 22.37 
12.60 5 
19.80 1 
20 5 
7 


70 
Sum of squares 70 602.12 
variance (sum of squares/11) 11.70 
SD (square root of variance) 3.42 7.40 


The variance is the mean squared deviation score; the SD is the square 
root of the variance. The reason why it is more convenient to work 
with the SD rather than the variance will become clearer in a moment 
when the normal distribution is discussed. For the meantime, the 
important point is that the SD is a very useful measure of dispersion. 
When the SD is relatively large, scores scatter widely about the mean; 
conversely, when the SD is relatively small, students do not vary widely 
in performance. 

There are numerous short-cut formulas for obtaining the SD. If an 
automatic calculator is available, the SD can be computed directly 
from raw scores without having to compute deviation scores. Formulas 
for this and for other computational approaches to obtain the SD are 
described in Appendix C-8. All the formulas give the same result. 

Like the mean, the SD is sensitive to very extreme scores. For ex- 
ample, it was seen that the squared deviation score for Eric in spelling 
(the last in the list) was 410.87, which served to greatly enlarge the 
standard deviation. As was mentioned previously, it is fortunate that 
such extreme deviations are unusual; consequently, the SD is a mean- 
ingful measure of dispersion in most practical work. 


Transformations of the Mean and Standard Deviation 


In developing the norms for a commercially distributed test, it often 
is convenient to transform raw scores to a new set of scores having a 


38 
Basic Principles of Measurement and Evaluation 


particular mean and standard deviation. One such transformation is to 
fix the mean at 50 and the standard deviation at 10. The advantage of 
using a set mean and standard deviation is that it facilitates the in- 
terpretation of particular scores. Such transformations help circumvent 
the problems in comparing scores on tests that differ with respect to 
mean and standard deviation. For example, if the records of a transfer 
student show that in the fourth grade he made a score of 420 on 
achievement test X, how does this compare with the norms on achieve- 
ment test Y used in the new school? This would be easier to answer if 
norms on both tests had been converted to distributions having the 
same mean and standard deviation, 

Transformations of score distributions are helpful to teachers in 
comparing results from different classroom examinations, for example, 
the results from four tests of arithmetic administered over a period of 
eight months. The tests might have differed considerably in terms of 
difficulty of problems and number of problems. These differences 
would make for differences in means and standard deviations. Convert- 
ing the scores from the four tests to distributions having the same mean 
and standard deviation would greatly facilitate interpretations of the 
results. Another advantage of such conversions is that it permits us to 
work with nice round numbers rather than with awkward numbers. 

How conversions of scores are made can be illustrated with a spell- 
ing test which, when administered to a large group of students, has a 
mean of 30 (words correctly spelled) and a standard deviation of 5. 
To convert scores to a distribution with a mean of 50 and a standard 
deviation of 10, the first step would be to multiply all scores by 2. The 
resulting scores will have a mean of 60 and a standard deviation of 
10. Then by subtracting 10 from all scores, a distribution is obtained 
with a mean of 50 and a standard deviation of 10. These calculations 
illustrate two rules in transforming distributions. First, if all scores are 
multiplied by a number, the mean and standard deviation are multi- 
plied by that number, Second, if a number is either added to or sub- 
tracted from all scores, the mean is either increased or decreased by 
the amount of that number, but the standard deviation is unchanged. 
In Appendix C-10 a convenient formula is described for transforming 


Scores to a new distribution having any desired mean and standard 
deviation. 


Score Distributions 


In addition to computing the measures discussed so far in this chap- 
ter, it often facilitates the interpretation of test results to make a 
graphic presentation of the scores. This will provide a useful picture 


39 
Scores, Norms, and Statistics 


of the total test results, The first step is to obtain a frequency distribu- 
tion, which is simply a count of the number of students who made 
each score. Frequency distributions for the arithmetic and spelling 
tests are as follows: 


Arithmetic Spelling 
Score Frequency Score Frequency 

22 1 75 1 

21 0 

20 1 Sie 

19 1 62 1 

18 0 

17 1 

16 0 1 

15 1 1 

14 3 0 

13 0 0 

12 2 3 

11 1 1 
1 
1 
1 


The two frequency distributions are presented graphically in Figure 
3-1. The vertical bar above each score point indicates the number of 
students making that score. The bars show that on the arithmetic test 
only two score points have frequencies greater than one: the score 
point of 12 has a frequency of 2, and the score point of 14 has a fre- 
quency of 3. Two pieces of information are obtained immediately by 
looking at frequency distributions—the mode and the range. The mode 
is represented by the tallest bar, 14 on arithmetic and 52 on spelling. 
The lowest and highest scores can be immediately seen (11 and 22, 


respectively, on the arithmetic test). The difference between these, 11 


for the arithmetic test, is the range. 


Although the eleven scores on each of the two tests are sufficient to 
d and interpreted, the 


indicate how frequency distributions are forme 
of the frequency distribution occurs when there are 
ad the scores of 200 students 
ney distribution probably 


real importance 
many scores being analyzed. Suppose we h 
on the arithmetic test. The resulting freque 
would look much like that in Figure 3-2. The frequency distribution 
that ten students made scores of 12 and twenty 


shows, for example, 
mode is 16, and the range is 13 (10 to 


students made scores of 18. The 
23). 


40 
Basic Principles of Measurement and Evaluation 


In addition to providing a handy pictorial representation of scores, 
the frequency distribution has other values due to the fact that such 
distributions tend to resemble a particular mathematical distribution 
with many interesting properties. Although exceptions are to be found, 
distributions obtained when many students are tested tend to have 
much the same appearance. The highest frequencies usually are found 
in the middle of the distribution, in the zone of scores from 15 to 17 in 
Figure 3-1. The further one goes from the middle (mean), the less 


Frequency 
n 


0. 
10 11 12 13 14 15 16 17 18 19 20 21 22 
Arithmetic scores 


Frequency 
N ow + 
T 


0 3 T 
47 48 49 50 51 52 53 54 55 56 57 62 74 75 
Spelling scores 


Figure 3-1. Frequency distributions for arithmetic and spelling tests. 


frequently scores occur, At the extremes, or “tails,” of the distribution, 
frequencies are very small. In Figure 3-2 only two students score 10, 
and only two score 23, Di ributions of test scores tend to be symmetri- 
cal; i.e., score frequencies tend to fall off at the same rate to the right 
and left of the mean. Moreover, the rate at which frequencies lessen 
when going in either direction from the mean is fairly predictable from 
what is called the normal distribution. Because of the extreme im- 
portance of the normal distribution in analyzing and interpreting test 
scores, further explanations and examples will be given. (Teachers 
should understand that the characteristics which have been described 
—symmetry and decreasing frequencies in general resemblance to the 
normal distribution usually occur only when there are many scores 
being analyzed. When only a few scores are being analyzed, distribu- 
tions can, and usually do, depart widely from these ideal Properties.) 


41 


Scores, Norms, and Statistics 


40 


35 


Frequency 
w 
o 


00 ſe 13 14 15 16 17 


Scores 


18 19 20 21 22 23 


Figure 3-2. Hypothetical frequency distribution of arithmetic scores for 200 


students. 


Normal Distribution. The normal distribution was discovered in 
ames of chance, where, for example, one problem is 
to estimate the frequencies with which certain coin tosses occur. 
Suppose, for example, that ten coins are tossed on a table. What are 
the odds that eight of them will be heads? If ten coins were tossed 
s would all ten be heads? What 
and two tails, or seven heads 


connection with g 


one thousand times, in how many tosse 
would occur more frequently, eight heads 
? The normal distribution was derived in connection with 


and three tails 
It is a mathematical formula which estimates 


questions such as these. 
the frequency with which chance events occur. 

The normal distribution formula relates to the situation in which 10 
coins are tossed many times, to choose a statistically convenient num- 
ber, say 1,024 times. The expected distribution of results is shown in 


Table 3-1. The distribution of results shows, as intuition would suggest, 
Table 3-1: Expected Occurrences of “Heads” and “Tails” for 10 Coins 
Tossed 1,024 Times 


210 120 45 10 1 


Frequency 1 i0 45 
1 iais j 0 1 2 3 4 5 6 7 8 9 10 
Tails 10 9 8 7 6 5 4 3 2 1 0 


that the most frequently occurring result would be 5 heads and 5 tails, 


occurring 252 times. On the extremes, 10 heads would be expected 


42 


Basic Principles of Measurement and Evaluation 


only once. The frequency distribution is shown graphically in Figure 
3-3. Note the general resemblance to the distribution of arithmetic 


scores shown in Figure 3-2. 


Suppose that each toss had employed 100 coins instead of 10, Then 
the graph would contain 101 bars, covering the range from zero heads 


320 


240 


160 


Frequency 


80 


3.8 PS we 


10 Heads 


4 oS 2 1 O Toils 


Figure 3-3, Graph of expected occurrences of “heads” and “tails” for 10 coins 


tossed 1,024 times. 


to 100 heads. Because of the larger 
look less jagged; i.e., the “steps” 


number of bars, the graph would 


on the graph would be much narrower. 


If the number of coins were increased to 1,000, the steps would be so 


small as to be hardly visible, and 


Frequency 


No. of heads 


Figure 3-4, Smoothed curve showing 
expectancies of “heads” for a large 
number of coins tossed many times. 


If the normal distribution applied 


be of little interest to us here. The reas 
is so important is that test scores often 
the normal distribution. In that case 


of one coin, and ea 


ch toss is the counterpart of 


the frequency distribution would 
begin to look like a smooth 
curve rather than a set of steps. 
Challenging the imagination fur- 
ther, what would the frequency 
distribution look like if there 
Were an infinite number of coins 
tossed an infinite number of 
times? The normal distribution 
tells us what to expect, which 
is shown by the smooth curve in 
Figure 3-4, 

only to games of chance, it would 
on that the normal distribution 
distribute themselves much like 
each test item is the counterpart 
one student. For ex- 


43 


Scores, Norms, and Statistics 


ample, if a well-constructed spelling test containing ten words were 
administered to 1,024 students, the results might approximate the 
frequency distribution shown in Figure 3-3. Zero heads would corre- 
spond to getting none of the spelling words correct, and ten heads 
would correspond to getting all the words correct. For reasons too 
technical to discuss here, the distribution of test scores obtained in 
practice probably would be slightly different from that shown in 
Figure 3-3; but if the test is well constructed (in accordance with 
principles which will be discussed in later chapters), the difference 
between the shape of the distribution of test scores and that obtained 
from coin-tossing experiments would be relatively slight. 

Instead of only a ten-item test, imagine that the test contains an 
infinite number of items and is administered to an infinite number of 
students. Then the results would tend to approximate the bell-shaped 
curve (normal distribution) shown in Figure 3-4. Of course, infinity is 
only a useful fiction, but the normal distribution is often well approxi- 
mated when the test contains thirty items or more and is administered 
to one hundred students or more. Because of the resemblance between 
many distributions of test scores and the hypothetical normal distribu- 
tion, it is possible to borrow some of the very useful mathematical 
results that follow from the normal distribution. 

Many different human attributes are distributed approximately in 
accordance with the normal distribution—heights of all the children in 
the fourth grade in a particular city, numbers of hairs on childrens 
heads, and scores on intelligence tests. Indeed, it is the exception to 
find human attributes which are not distributed, at least roughly, in 


accordance with the normal distribution. 

There is nothing magical about the normal distribution and no neces- 
sary reason why test scores should be so distributed. But because the 
normal distribution is an intuitively appealing way to think about test 
ause having an appro <imately normal distribution opens 
the door to many useful mathematical procedures, we often construct 
tests in such a way as to ensure an approximately normal distribution. 
ö ry to be compulsively concerned about 


scores, and bec 


In practice it is not nece 
the normal distribution. All that is necessary in order to use the proce- 
dures which depend on the normal distribution is to look at the fre- 
quency distribution and see if the resemblance is reasonably close, For 
all practical purposes, a distribution of test scores can be considered 
a reasonable approximation to the normal distribution if the bulk of 
the scores cluster about the mean, if the distribution is not markedly 
lopsided, and if scores trail off about the mean in general resemblance 
to the curve shown in Figure 3-4. 

In spite of the usefulness of the 
results of tests, it should be clearly 


normal distribution in analyzing the 
understood that having a normal 


44 


Basic Principles of Measurement and Evaluation 


distribution is not necessarily an indication that the test is valid in any 
of the three senses in which test validity was discussed in Chapter 2. 
Remember that coin tossing would provide a good approximation of 
the normal distribution, and, although students sometimes accuse us of 
using it, tossing coins would not provide a valid test. 

Standard Scores. One important property of a score distribution is 
its standard deviation. As was shown previously, the standard devia- 
tions for the arithmetic and spelling distributions are 3.42 and 7.40, 
respectively. Using the standard formula, a standard deviation could 
be obtained for the 200 arithmetic scores shown in Figure 3-2 and even 
for the distribution of “heads” shown in Figure 3-3. When the distribu- 
tion of scores is approximately normal, the standard deviation has a 
very useful property: it indicates the percentages of students who lie 
in various regions of the distribution. 

Previously it was said that it is difficult to interpret raw scores and 
that one of the ways to make scores more interpretable is to convert to 
deviation scores ( by subtracting the mean raw score from each raw 
score). It is difficult to interpret deviation scores, however, until the 
dispersion of scores is considered. One way to do this is to compare 
each deviation score with the standard deviation, in other words, to use 
the standard deviation as a unit of measurement, Scores analyzed in 
this way are called standard scores and are obtained by dividing each 


deviation score by the standard deviation of the particular distribution 
as follows: 


Standard score = . deviation S 
standard deviation 
Applying the formula to Johnny’s arithmetic score we find: 


22 — ji 
Standard score = 22 — 1: 


On the spelling test, Johnny’s standard score is: 


48 — 54.73 
740 
—6.73 
7.40 


= — 9 


Standard score = 


Johnny is almost two standard deviations above the mean on arithmetic 
and almost one standard deviation below the mean on spelling. 


45 


Scores, Norms, and Statistics 


The reason why the normal distribution is important is that, if the 
distribution of test scores resembles the normal distribution, standard 
scores on the test can be easily interpreted. In that case, it is very easy 
to determine the approximate number of students who score above or 
below any specified number of standard deviations above or below the 
mean. A complete table showing the per cents of students lying in 
various regions of the normal distribution is given in Appendix B. A 
less detailed breakdown is shown in Figure 3-5. The figure shows, for 
example, that only about 2 per cent of the students score above two 
standard deviations—have standard scores of 2.0 or higher. Because 
Johnny has a standard score in arithmetic of 1.9, this means that most 
of the students score lower. (The exact interpretation of per cents of 
students lower would require many more than eleven students being 
tested, In the small number of students in this example, Johnny actually 
has the highest score in arithmetic. ) 


Frequency 


200 3.00 


-300 -200 -1.00 0 1.00 
Standard deviation 


Figure 3-5. Per cents of subjects in various regions of the normal distribution. 
(The per cents add up to 99.6 instead of 100 because a fraction of 1 per cent of 
the cases lies above and below three standard deviations.) 


As Figure 3-5 shows, practically none have standard scores below 
—3.0. Going upward, about 2 per cent score less than —2.0 standard 
scores, 16 per cent less than —1.0, 50 per cent less than .0 (below the 
mean), 84 per cent less than 1.0, and 98 per cent less than 2.0. When 
there are at least thirty items and one hundred students, such interpre- 
tations of standard scores are fairly accurate. Even when there are 
fewer items and students, say twenty items and thirty students, the 
normal distribution provides a useful approximation to the per cents of 
students scoring above and below selected points on the score 
continuum. 

Standard scores 
commercially distributed tests such 
tests usually have many items and are 


(or transformations of them) are widely used in 
as intelligence tests. Because these 
administered to thousands of 


46 


Basic Principles of Measurement and Evaluation 


students, and because distributions closely approximate the normal 
distribution, standard scores can be interpreted rather exactly. 

As was mentioned previously, it frequently aids the interpretation 
of test results to convert the distribution to one having a desired mean 
and standard deviation, Essentially this is what is done in converting 
scores from raw (or deviation) scores to standard scores. Expressed 
as standard scores, all distributions have the same mean, which is zero; 
all distributions have the same standard deviation, which is 1.0, The 
value of using this transformation rather than some other transforma- 
tion is that it ties in directly with the normal distribution and provides 
an easy guide to the numbers of students scoring above or below 
selected points on the test continuum. 

Transformed Standard Scores. Although standard scores are di- 
rectly useful to anyone who is familiar with educational measurement, 
people who are naive in this respect have some difficulty in interpret- 
ing standard scores. For example, a standard score of zero is often 
misinterpreted as meaning zero instead of average performance on 
the test. Some people find it difficult to understand negative standard 
scores, those below the mean, For these reasons, standard scores often 
are transformed to a distribution having a desired mean and standard 
deviation. One such distribution (mentioned earlier) is obtained 


a 


when standard scores are transformed to a new distribution having 
a mean of 50 and a standard deviation of 10. A handy formula for 
making such transformations is presented in Appendix C-10. 


Scores Based on Ranks 
There are two principal methods of converting raw scores to more 
interpretable units. One has been discussed extensively in preceding 
pages. In summary, it consists in transforming raw-score distributions 
to distributions with prescribed means and standard deviations, e.g., 
standard scores. The second method is based on ranks. As will be 
shown, the methods are complementary, and, ultimately, they provide 
much the same information. 
One of the simplest and in many ways most sensible method to 
transform raw scores to more meaningful units is to rank raw scores 
from highest to lowest. In this connection i 
ing the scores made by twenty students 
student who gets the most ans\ 
student who gets the next mo 
and so on to the student who 
given a rank of 20. This simple 
there are tied scores, as there 
students are given the aver: 


magine that we are study- 
on a thirty-item test. The 
wers correct is given a rank of 1, the 
st answers correct is given a rank of 2, 
gets the fewest answers correct, who is 
scheme is complicated somewhat when 
almost always are. When that occurs, 
age rank. For example, if three students 


47 


Scores, Norms, and Statistics 


make identical raw scores and are tied for second, each student 
receives the average of ranks 2, 3, and 4, which is 3. If two students 
are tied for rank 4, they would each receive the average of ranks 4 
and 5, which is 4.5. Following is a list of ranks for an arithmetic 
test: 


Student Raw score Rank 


Kim 29 1 
Fred 27 2.5 
Mary 27 2.5 
Bill 26 4 
Sarah 24 5 
Lewis 21 7 
Martin 21 7 
Jean 21 7 
Patricia 20 9 
Walter 18 10 
Susan 17 12 
Russell 17 12 
Scott 17 12 
Keith 16 14 
Janet 15 15.5 
Betty 15 15.5 
Ronald 14 17 
Lee 12 18 
Caroline 10 19 
Cecil 8 20 


When dealing with the scores for a test administered to only one class, 
ranks are in many ways the most desirable units of scoring. They are 
easy to obtain, and they are easily communicated to others. 
Percentiles. When many students are being studied, and when 
comparisons are being made among students in different localities, 
it is useful to make a transformation of ranks to what are called 
percentile ranks. A percentile rank is simply the per cent of students 
who fall below a particular score. Thus if 95 per cent of the students 
score lower than Fred, Fred is said to be at the 95th percentile. If 
only 20 per cent of the students score lower than Mary, she is at the 
20th percentile. ‘ 
Percentiles are very much like the ranks that would be obtained in 
a group of exactly 100 students, except that in using ranks we cus- 
tomarily give the highest score a rank of 1. If instead we gave the 
lowest score a rank of 1 and the highest score a rank of 100, these 
would be almost identical with percentiles. The slight difference is 
that percentiles are defined as the per cent of students who score 


48 


Basic Principles of Measurement and Evaluation 


lower than a particular score. Thus, the person with rank 100 would 
receive a percentile score of 99, because 99 per cent of the students 
are lower. Also, the student with rank 1 would receive a percentile 
score of 0 because none of the students score lower. However, the 
difference is so slight that it is useful to think of percentiles as repre- 
senting ranks when (a) exactly 100 scores are being studied and (b) 
when the largest rank (100) is given to the student with the highest 
test score and the smallest rank (1) is given to the student with the 
lowest test score. 

If there are no tied scores, percentiles are obtained by finding the 
per cents of students below each raw score. For example, if in study- 
ing 200 test scores, Fred makes a score of 76, no other student makes 
a score of exactly 76, and 160 students make scores of less than 76, 
then Fred is at the 80th percentile (160 divided by 200, and the 
result multiplied by 100). 

Because there nearly always will be tied scores, a slight modifica- 
tion of the method described above must be used to obtain percentiles. 
Why such a modification of procedures is necessary is illustrated by 
the situation in which 35 per cent of the students score higher than 
Fred, 50 per cent score lower than Fred, and 15 per cent (including 
Fred) make exactly the same score. In that case it would be mislead- 
ing to say that Fred is at the 50th percentile, because, in fact, only 
35 per cent of the students score higher. This ambiguity can be 
remedied by considering half of the students who make the same 
score as Fred as scoring higher and half of the students as scoring 
lower. Then the first step in obtaining percentiles is to find the 
number of students who score below a particular raw score plus half 
of the number of students who make the particular raw score. The 
total is divided by the number of students in the study, and the results 
are multiplied by 100, For example, if twenty students score higher 
than 44, seventy students score lower than 44, and ten students score 
exactly 44, then 44 corresponds to a percentile of 75. By this method 
percentiles can be calculated for all scores, regardless of ties in 
raw scores. 

Percentiles and standard scores supply much the same information. 
It will be remembered from previous sections that when certain 
conditions hold, standard scores indicate the per cents of students who 
fall in various score regions. For example, approximately 98 per cent 
of the students will have standard scores less than 2.0; approximately 
84 per cent of the students will have standard scores less than 1.0, 
and so on for the other regions of the normal distribution. One of the 
major pieces of information obtained from standard scores is the per 
cents of students making scores above and below particular points on 


49 


Scores, Norms, and Statistics 


the test continuum. Because of the ease with which standard scores 
can be converted to per cents of students in various score regions, one 
can obtain percentiles by converting all raw scores to standard scores 
and then transforming these to percentiles using the table in Appendix 
B. This will offer an approximation to the percentiles obtained by the 
more direct method described previously. The approximation usually 
will be good if (a) the distribution of raw scores is approximately 
normal, (b) there are many test items (at least 30), and (c) at least 
100 students are being studied. 

One must be careful not to confuse percentile scores with per- 
centage-correct scores. Regarding the latter, it sometimes is useful to 
think in terms of the percentage of items that students get correct. 
However, as should be obvious, percentage-correct scores do not 
directly tell anything about students’ standings with respect to one 
another, If an easy test is being used, a student can get 75 per cent of 
in the bottom quarter of his class and thus 
ss than 25. If a hard test is being used, a 
score of only 50 yet be at the 


the items correct yet be 
have a percentile score of le 
student can have a percentage-correct 
90th percentile in his class. 

An important property of percentiles is that they are separated by 
more items on either end of the distribution than they are in the 
middle. This is because there are many more raw scores, and ties 
in raw scores, in the middle of the distribution than at the extremes. 
Consequently, it may be that the student at the 95th percentile gets 
: fi student at the 90th percentile, but 


five more items correct than the 
the student at the 55th percentile may get only one more item correct 


than the student at the 50th percentile. Similarly, more items separate 
adjacent percentiles in the lower end of the scale than in the middle 
of the scale. One consequence of this property is that it usually is not 
Wise to average percentiles obtained on two tests. Rather it is better 
to average raw scores Or standard scores and then convert the aver- 


ages to percentiles. 


Norms 
been discussed so far: those based on 
ranking, such as percentiles, and those based on the normal distribu- 


tion, such as standard scores. Both of these were illustrated with 


relatively small groups of students in classroom settings. Comparisons 


among students in individual classes are very important. They help 
determine grades, sectioning, and plans of study and help settle other 
problems of day-to-day instruction. In addition, it is often necessary 
to compare scores with standards that are somewhat external to the 


Two types of scores have 


50 


Basic Principles of Measurement and Evaluation 


performance of the particular students. One such standard is what 
the teacher expects of his pupils, Regardless of who is above and 
below the mean, and by how many standard deviations, the teacher 
may decide that all the students are doing poorly or doing well. The 
teacher's standards will be discussed in Chapter 7. 

Another important type of standard is the performance of larger 
groups of students, possibly including students in other classes at 
the same school and students from other schools and other cities, 
Scores obtained from such large groups of students are called 
norms. 

The teacher can see that Johnny made a high score in spelling 
relative to the other students in the class. It is easy to jump to the 
conclusion that Johnny is a good speller in the absolute sense, that he 
could hold his own with students anywhere. This can be determined 
only by comparing Johnny’s spelling ability with that shown by much 
wider groups of students. 

How does the teacher evaluate how well his class as a whole is 
doing? Are they learning to spell well, or are they doing poorly? 
Although the teacher's past experience and expectations are important, 
normative comparisons also are very helpful. By comparing the aver- 
age performance in the class with the average performance in other 
schools and other cities, the teacher would have a better idea of how 
well his class was progressing. 

The first step in obtaining norms for a test is to define a normative 
population, What population is defined depends on the interpretations 
that need to be made of scores. This may be all the children in the 
United States (the normative population for most intelligence tests), 
all the children in a particular school system (which would offer one 
basis for interpreting achievement tests), or all the children in the 
fourth grade at Woodlawn Elementary School (which would be useful 
in making decisions about pupils in the local setting ). 

The construction and use of norms is not nearly as simple as it 
appears at first glance, and there are some definite pitfalls to be 
avoided. The use of norms will be discussed at a number of points 
in the following chapters. In this chapter we will discuss some of the 
types of scores with which norms usually are expressed, 

Percentile Norms. Norms can be expressed by percentiles in much 
the same way that the teacher converts the scores of his students to 
percentiles. When using large-scale norms, however, the teacher com- 
pares the performance of particular students with the performance of 
the normative group. In developing percentile norms for commercially 
distributed tests, the test author gathers responses from many children 
(usually more than one thousand and sometimes as many as forty 
thousand). He then translates raw scores into percentile equivalents. 


51 


Scores, Norms, and Statistics 


The percentile equivalents are published in the manual of instructions 
for the test. An example is as follows: 


Raw score Percentile 
140-142 99 
138-139 98 

137 97 

136 96 

14 1 
12-13 


If Johnny makes a score of 140, 141, or 142, he is at the 99th percentile, 
having scored higher than 99 per cent of those in the normative group. 
If he scores either 138 or 139, he is at the 98th percentile. Scores of 
137 and 136 are, respectively, at the 97th and 96th percentiles. Per- 
centile equivalents are not ‘shown for percentiles of 95 through 2. 
A score of 14 would represent the 1st percentile, with only 1 per cent 
of the group standing lower. 5 

Standard Score Norms. Norms are often expressed as either stand- 
ard scores or transformed standard scores. The performance of 
Particular pupils is then compared with the mean score in the 
Normative group and with the standard deviation found in the norma- 
tive group, Rather than go to the trouble of actually subtracting the 
mean from Johnny’s score and dividing this by the standard deviation, 
nearly all commercially distributed tests have tables available to 
translate raw scores directly into standard scores and/or transformed 
standard scores. Following is an example employing both standard 
Scores and norms transformed to a distribution having a mean of 50 


and a standard deviation of 10: 


Raw score Standard score 

148 

147 

85 sd 51 
84 (mean) 0 of 
83 El a9 
2¹ 18 
22 uf 


Age Norms. It sometimes is desirable to ae one in a 
i 8 C Z r could be obtained by 
of children’s ages. One such set of age norms c 3 


52 


Basic Principles of Measurement and Evaluation 


testing the vocabulary of children at all ages from four to twelve. 
For this purpose, a list of 100 words varying in difficulty could be 
used. The mean score could be obtained for each age group separately. 
A graphic plot of these would resemble that shown in Figure 3-6. The 
figure shows, for example, that the average seven-year-old child has a 
larger vocabulary than the average six-year-old, as would be expected. 

Age norms would prove useful 


90 in interpreting the vocabulary 
80 scores of particular pupils. Sup- 
70 pose that a child is seven years 
old and makes a vocabulary score 

$ 80 of 25 (correctly identifies or sup- 
5 50 plies the meaning of twenty-five 
5:26 words in the list). The mean 
E score for seven-year-olds in the 
0 normative group is 30, which in- 
20 dicates that the particular pupil 
10 is somewhat below average for 
0 his age level. To determine how 


4 5 6 7 8 9 10 1 i2 much below average, we can find 
Age the age group which corresponds 
Figure 3-6. Mean vocabulary scores of to a score of 25. Although there 
children at each age from four to is no age group whose mean 
twelve years, score is exactly 25, we can work 
i as though the curve were con- 
tinuous throughout all age levels and determine the fractional age 
group corresponding to that score, Reading from Figure 3-6, it is 
seen that an age group of approximately 6.5 corresponds to a score 
of 25. It can be said that the seven-year-old child has a vocabulary 
approximately that of the average 614 year old. Age norms fit in 
well with the way in which we customarily think about the progress 
of children, and they are, therefore, very useful in the interpretation 
of test scores. f 
Educational Age. Age norms are often employed with commer- 
cially distributed achievement tests which cover most of the subjects 
taught in school. After the test is administered to a broad sample 
(normative group), average scores can be found for each age level. 
Then in all future uses of the test, scores of particular pupils can be 
compared with the age norms. If a child is nine years old and achieves 
an age score of only seven, this means that he is considerably behind 
his age group. Such an age score is often referred to as the educational 
age (EA) of the pupil. If some cautions are heeded (ones which will 
be discussed later in this chapter), the EA offers one way of 
Interpreting the progress of students. 


53. 


Scores, Norms, and Statistics 


Mental Age. Age norms are employed with some of the commer- 
cially distributed intelligence tests. By comparing a child’s score with 
age norms, it can be determined whether he is more or less intelli- 
gent” than the average child of his age. A score which is compared 
in this way with age norms is referred to as a mental age. If a six- 
year-old performs as well as the average seven-year-old child, he 
is said to have a mental age of seven. Although there are some 
definite dangers in interpreting mental ages (which will be discussed 
later) they have been used quite widely, and if not overinterpreted, 
they offer a useful way of understanding the mental ability of 
pupils. 4 

Grade Norms. Norms which are very similar to age norms can 
be obtained by finding the mean scores for children at various grade 
levels. If a standard spelling test were given to children in the fourth 
through eighth grades, the means could be plotted by grades, much 
as they were plotted by ages in Figure 3-6. Comparing scores with 
grade norms gives the teacher an indication of how well pupils are 
progressing. 


Age and grade norms are usually obtained at the end of the school 


year, for example, by testing children who are finishing the fourth, 
fifth, ete grades. Age norms usually are very similar to grade norms, 
and it is difficult to argue that one type of norm is generally better 
than the other. The advantage of grade norms is that comparisons are 
made among children who have had the same amount, if not precisely 
the same kind, of education. The relative disadvantage of grade norms 
is that they tend to penalize accelerated pupils and overestimate the 
Progress of retarded pupils. If a pupil has just been double promoted, 
he may appear only average in comparison with his classmates, who 
are a year older; but he would still appear superior with respect to 
children of his own age (using age norms rather than grade norms). 
Also, because children sometimes differ by as much as eight months 
or more of age within grades, there is some inequity in the use of 
grade norms. When possible, it is helpful to convert students’ scores 
fo both age norms and grade norms to utilize the slightly different 
sinds of information which they supply. 

Quotient Scores as Norms. “With a number of the norms discussed 
Previously, there are two indices involved: educational age and chrono- 
gical age, mental age and chronological age, grade equivalent and 
actual grade, and others. It is tempting to divide one component into 
the other and obtain a quotient of performance. This has proved so 
Popular that a number of such quotient scores have been widely used. 


Although in a later section it will be shown that such quotient scores 
and better indices are available, the quotient 


have . $ 
ve very serious faults, 
d to be ignored here. 


Scores > 8 
Ores have been too widely use 


54 


Basic Principles of Measurement and Evaluation 


Lhe most popular quotient score is the intelligence quotient (IQ) 

which is obtained as follows: 
MA 

IG: = ON X 100 
In this formula, MA stands for mental age and CA stands for chrono- 
logical age. The formula shows, for example, that if an eight-year-old 
child does as well on an intelligence test as the average ten-year-old, 
he has an IQ of 125. An IQ of 100 is precisely average; above 100 
means above average, and below 100 means below average. 

Similar to the IQ, an educational quotient (EQ) can be obtained 
by dividing educational age by chronological age and multiplying the 
result by 100. Whereas the IQ is intended to represent the relative 
intelligence of the child, the EQ is intended to represent the relative 
progress of the child in school. 

An even more complex quotient score can be obtained by dividing 
EA by MA (or EQ by IQ), which is called the accomplishment 
quotient (AQ). This supposedly represents the extent to which the 
student is working up to capacity. However, for reasons which will be 
discussed in a later section, the AQ is so riddled with conceptual 
pitfalls and statistical artifacts as to be nearly worthless, 


Cautions in the Use of Norms 


Norms, expressed in the forms discussed in the previous section, 
are completely essential to anyone who works with tests, Unless the 
weak points and potential pitfalls in the use of norms are understood, 
however, some very poor interpretations can be made of test results. 
In order not to interrupt the explanation of how various norms are 
obtained, extensive criticisms were not made of the various types of 
normative scores, Criticisms and cautions will be given in this section. 

Sampling. A potential fault of any set of norms relates to the 
Way in which the normative group is obtained. The normative group 
is supposed to be representative of some defined population. As was 
said previously, the population may be all the children in the United 
States, all the children in a particular school district, or all the 
children in a particular school, depending on the ways in which the 
norms are to be used. To obtain representativeness, it must be ensured 
that the normative group is unbiased and that sufficient numbers of 
students are tested. 

A normative group (in statistical language, a sample) is unbiased 
if every child in the designated population has an equal chance of 
being selected for testing. One of the surest Ways to guarantee that 
a sample is unbiased is to select children randomly, i.e., draw names 
out of a hat. Random samples can be drawn when norms are con- 


55 


Scores, Norms, and Statistics 


structed locally, but for regional and national norms, approximate 
procedures must be employed. If carefully done, these approximate 
sampling procedures can lead to representative results, Criticisms by 
experts of sampling procedures used in the standardization of tests 
are usually found in the research literature. (Some of the most prom- 
inent sources for such criticisms are listed in Chapter 17.) 

In addition to being unbiased, a sample must contain sufficient 
numbers of children. If, for example, there were only twenty children 
in the sample (normative group), luck might have it that as a group 
they would be much above average or much below average. Norms 
obtained from so small a group would provide a very poor basis for 
the interpretation of scores. In comparison to such norms, a child 
who is really average might appear either superior or much below 
average. The only definite rule that can be given in choosing the 
number of students to test is “the more the merrier.” To obtain 
national norms, it is wise to include at least several thousand students 
in the sample. If norms are to be obtained from a particular school 
district and there are twenty thousand students in the district, it would 
be wise to test at least one thousand students (and, as was said previ- 
ously, these should be selected by a random, or approximately random, 
procedure). 

If in a test manual it is indicated that the sample was biased (e. g. 
only children in Chicago were tested), or only a relatively small 
sample was employed, the reported norms should be held suspect. 
Because of these potential errors of sampling, it is often found that a 
child will make rather different scores on two tests, purely because 
of the nonrepresentativeness of one or both of the samples used to 
obtain norms. For this, and for numerous other reasons to be discussed 
later, it is always much safer to average the normative scores obtained 
from two different tests than to depend on the results from one test 
alone. This rule cannot be overemphasized. 

Unreliability. Regardless of how carefully norms are obtained, a 
child’s score should not be taken as a final and exact indication of 
Performance. Any particular score depends to some extent on pure 
luck, the luck being either good or bad depending on the child, the 
test, and the day on which he is tested. If a child scores at the 70th 
Percentile on a particular test on a particular day, it is wrong to think 
of this as an exact indication of his ability; but rather, one should 
consider his score as somewhere close to the 70th percentile, maybe 

f the chance factors (unreliability ) 


higher or maybe lower. Because o 5 
that are present in all test scores, the rule above receives added sup- 


port: the results from two different tests given on different days are 

better than one. 
Normative Scores. 

formed standard scores) 


Both percentiles and standard scores (or trans- 
are excellent ways in which to express norms, 


56 


Basic Principles of Measurement and Evaluation 


Both indicate where the child stands relative to children in the norma- 
tive group, and both are easy to interpret. Uses of grade norms and 
age norms do not introduce new types of scores, but rather they 
determine the normative groups over which percentiles and standard 
scores are to be computed. In this way, for example, percentile scores 
for a test are determined with respect to children in the sixth grade, 
or standard scores are determined with respect to eight-year-olds. 

In contrast to percentiles and standard scores, the quotient scores 
described previously definitely are not recommended. One of their 
largest faults is that the same quotient seldom has the same meaning 
from grade to grade or age to age. This is because in order for 
quotients to have the same meaning at different levels (ages or 
grades), it is necessary for quotient scores to have the same standard 
deviations at different levels. For example, suppose that a five-year-old 
and a ten-year-old both have EQs of 120. It is easy to jump to the 
conclusion that they are equally advanced with respect to their educa- 
tional accomplishment. This is a correct conclusion only if the standard 
deviations of EQs are the same for five-year-olds and ten-year-olds. 
Suppose that the standard deviation for five-year-olds is 10. Then the 
five-year-old child is two standard deviations above his age group, 
approximately at the 98th percentile. In contrast, suppose that the 
standard deviation of EQs for ten-year-olds is 20 instead of 10, Then 
the ten-year-old child is only one standard deviation above his age 
group, approximately at the 84th percentile. 

In constructing and standardizing a test, it is nearly impossible to 
equate standard deviations of quotient scores at different age and 
grade levels, Consequently, misinterpretations are sometimes made of 
particular quotient scores, Also, quotient scores on different tests often 
have different meanings because of differences in standard deviations. 
If one child manifests an IQ of 125 on a test and another child mani- 
fests an IQ of 130 on another test, it may be the case that the first 
child actually would have a higher score ‘expressed as either percen- 
tiles or standard scores, Far better than to use either the EQ or the 
IQ (as obtained from quotient scores) is to express school perform- 
ance and intelligence as percentiles and/or standard scores. (On the 
more recently developed intelligence tests, the so-called “IQ” is really 
a transformed standard score rather than a ratio of MA to CA. This im- 
provement will be discussed more fully in Chapter 12.) Because of the 
many difficulties with the EQ and the IQ when they are expressed as 
quotient scores, it should be obvious that the AQ (which is the quotient 
of EQ over IQ) is even more unreliable and difficult to interpret. 

Achievement and Intelligence. Regardless of whether achievement 
and intelligence are expressed by quotient scores (EQ and IQ) or by 


percentiles and standard scores, there are some dangers in trying to 


57 


Scores, Norms, and Statistics 


contrast the two. Achievement tests are intended to assess progress in 
school. Intelligence tests are supposed to measure capacity, or apti- 
tude—the propensity to learn, which may or may not have been evi- 
denced in actual achievement in school. Although the distinction is 
theoretically useful, at the present time it is not possible to measure 
capacity separately from achievement up to a particular point in time. 
Achievement tests overlap intelligence tests so much in content and 
correlate so highly in practice that it is difficult to distinguish the two. 
There are exceptions—the child who has a high IQ and who is failing 
in school because of health problems or maladjustment. But such ex- 
ceptions do not occur nearly as often and not nearly to the same extent 
as is generally thought. 

Most apparent differences between achievement and intelligence are 
due to measurement error—chance factors which make the child score 
higher on one than the other. For example, if a child has a percentile 
score of 80 on an achievement test and 90 on an intelligence test, it 
would be highly dangerous to conclude that he is not working up to 
capacity. If he were given two different tests on a different occasion, 
luck might have it that he would appear at the 90th percentile on 
achievement and at the 80th percentile on intelligence. (This im- 
portant point will be pursued more fully in later chapters. ) 


The Teachers Use of Norms 


What will the teacher use in interpreting the results of the mid- 
semester examination in ancient history? He probably will not employ 
any of the elaborate methods of analysis which have been mentioned, 
es to percentiles and standard scores. He usu- 
ally does not have the time, and if he did, there probably would not 
be enough test items or students to make such statistical results mean- 
ingful. He might go to the trouble to compute the mean and the stand- 


such as to convert scor 


ard deviation. 
Teachers are sometimes called on to analyze the results from tests 


given to large segments of a student body, such as a final examination 
in mathematics which is given to over a hundred students in a half- 
dozen or more sections. In such cases, it would be appropriate and 
worth the effort to compute percentiles and/or standard scores. 

The principal need for a knowledge of scores and norms is in inter- 
preting the results from commercially distributed tests of achieve- 
ment, intelligence, personality, interests, and others. There the teacher 
will see percentiles, transformed standard scores, educational quo- 
tients, and others. Without understanding these, the teacher can make 
some very bad mistakes in interpreting test results, To enhance that 
understanding has been the major purpose of this chapter. 


58 


Basic Principles of Measurement and Evaluation 


Summary 


Raw scores obtained on achievement tests, teacher-made tests, and 
others are not directly meaningful until they are compared with two 
types of standards. The first type of standard is what the teacher ex- 
pects of his students, a discussion of which was reserved for later chap- 
ters. A second type of standard is had by comparing the scores of stu- 
dents with those of other students. This chapter was devoted largely 
to discussing the procedures that are useful in applying this second 
type of standard. 

In order to understand how well students do with respect to one 
another, the first need is a measure of central tendency. It was said 
that the arithmetic mean is the preferred measure of central tendency 
in most situations. Second to obtaining a measure of central tendency 
is to obtain a measure of the spread, or dispersion, of scores. The 
standard deviation was shown to be very useful for that purpose. Be- 
cause most distributions of test scores resemble the normal distribu- 
tion, many useful properties of the normal distribution can be bor- 
rowed. Two important advantages of converting raw scores to stand- 
ard scores are (a) it provides a simple way of averaging the scores 
obtained from different tests, and (b) it provides an estimate of the 
per cents of students who lie in various regions of the test continuum. 

Complementary to the use of standard scores is to rank the raw 
scores of students and convert these to percentiles. Both indicate how 
well students perform with respect to their classmates or with respect 
to larger groups of students. 

For studying the results of their own tests, teachers will seldom 
find the need to compute means, standard scores, percentiles, and 
other such statistics, However, in order to compare the progress of 
their students with larger groups of students—throughout a school 
system, a geographic region, or the nation as a whole—it is necessary 
to use such transformations of raw scores and nec ary for teachers 


to understand some simple principals concerning scores, norms, and 
statistics. 


Suggested Additional Readings 


Anastasi, Anne. 
chap. 4, 


Blommers, P, and Lindquist, E. F. Elementary statistical methods in psychology 
and education. Boston: Houghton Mifflin, 1960 


Flanagan, J. C. Units, scores, F. Lindquist (Ed.), Educational 
measurement, Washington: American Council on Education, 1951, chap. 17. 

Garrett, H. E, Elementary statistics. New York: Longmans, 1956. 

Seashore, II. G. Methods of expressing test scores, Test Serv. Bull. No. 48. New 
York: Psychological Corporation, 1955, 


Psychological testing. (2nd ed.) New York: Macmillan, 1961, 


and norms, In E, 


chapter 


Correlation 
and Reliability 


One of the most important features of any test is its reliability. A test 
is reliable if it provides highly precise indications of students’ stand- 
ings with respect to one another; if a test is not highly reliable, a zone 
of uncertainty must be considered in interpreting particular scores. 
For example, if on a highly reliable achievement test it is found that 
a student stands at the 80th percentile, this can, with some confidence, 
be taken as a rather exact indication of his standing. If the test were 
not highly reliable, then the score would have to be interpreted with 
caution, and the possibility would have to be considered that the stu- 
dent’s real standing was either considerably higher or considerably 
lower than that shown by the test. Obviously it is desirable for all 
educational tests to have a high degree of precision (be highly re- 
liable); therefore, test reliability is one of the most important topics 
to be discussed in this book. 

It is not possible to fully understand test reliability until some basic 
principles of correlational analysis are grasped. One of the most im- 
portant statistical problems is to determine the extent to which two 
tests are related. For this purpose a statistical measure is available 
which is called the correlation coefficient, which will be discussed in 
this chapter. 

Test reliability 
able, As was mentioned pre 
sults from any particular test given on any particular day, resulting in 
some students scoring lower than they should and other students scor- 
ing higher than they should. To the extent to which particular results 
are due to luck, different scores should be obtained on a different oc- 
casion. By administering the same test on two occasions, or by ad- 
1 ar tests on two occasions, the extent to which 
influences scores can be determined. If the 
(say coins were tossed to deter- 


concerns the extent to which test results are repeat- 
viously, some luck is involved in the re- 


ministering two simil 
luck (measurement error) 
test results were entirely due to chance 


59 


60 


Basic Principles of Measurement and Evaluation 


mine grades on both occasions ), there would be practically no relation- 
ship between the two tests. If instead, two long, carefully standardized 
tests (say, spelling tests) were used, luck would not be an important 
factor, and high correspondence would be found between the two 
tests. The correlation coefficient is used to measure the amount of cor- 
respondence, and when it is used in this way, it is called the reliability 
coefficient. 


Correlational Analysis 


Mr. Martin is comparing the results of the first test in ancient history 
with the results of the second test. He wants to know how well the 
two sets of scores correspond. He can see that many of the students 
have done relatively as well or as poorly. Fred made the top score on 
the first test, and he is near the top on the second. Poor Jimmy, he 
was on the bottom of the heap both times. But here is an exception: 
Joan was only average the first time, but she is near the top on the 
second test. Fred really fell down. He was above average on the first 
test, and now he is barely above Jimmy. Because of the number of 
students in the class, and because in some cases agreement is good 
and in other cases scores changed markedly, it is difficult to sum- 
marize the over-all] degree of correspondence between the two sets 
of scores. In this case, Mr. Martin would find correlational analysis 
very helpful. 

Computation. In determining the 
tween the two sets of scores, Mr. Martin quickly grasps one of the 
fundamentals: the absolute size of the scores on the two occasions is 
irrelevant. There were twice as many questions on the second test; 
therefore, scores are generally larger on the second occasion. Jimmy 
got 22 multiple-choice items correct on the first test and 46 on the 
second test, but he was at the bottom of the 
important consideration is the rel 


amount of correspondence be- 


list both times. The most 
ative ordering on the two tests. Con- 
sequently, to understand the correspondence, it would help if all the 
Scores were converted to ranks or to some other type of relative scores 
such as percentiles or standard scores. It happens that it is mathe- 
matically most convenient to start with standard scores, as shown in 
the table on page 61. The two sets of standard scores show, for ex- 
ample, that student A made the highest score on the first test and next 
to the highest score on the second test. Student E was exactly at the 
mean on the first test and above the mean (.59 SDs) on the second 
test. 

To further illustrate the correlation problem, in Figure 4-1 a graphic 
comparison is made of the two sets of scores. Here it can be seen that 
there is a strong correspondence between the two tests. The points 


61 
Correlation and Reliability 


First test standard Second test standard 


Student scores scores 

A 1.55 

B 1.16 

Cc S60 

D 39 

E 00 

9 —.39 

l —.77 

H —1.16 
—1.55 


tend to group about the best-fit line which is shown in the figure. 
(How such best-fit lines are obtained is explained in Appendix C-2.) 
Most of the points fall on or near the line, but there are definite excep- 
tions. 

After standard scores are obtained, it is very easy to compute the 
correlation coefficient (symbolized as r). First, multiply the two stand- 
ard scores for each student. Add these, and divide by the number of 


Second test 
(H 
3.00 | 


1.00 ji 
Best-fit line 


sal 


(-) 


Second test 


Figure 4-1, Comparison of standard scores on two tests for ancient history. 


62 
Basic Principles of Measurement and Evaluation 


students. The result is r, the correlation coefficient. The computational 
steps are shown as follows: 


1.55 K 118 = 1:83 
1.16 xX 177= 2:05 
TTX 59 = 45 
39 X —1.18 = —.46 
00 X 59 = 00 
—.89X =59 22 
—.77 X —.59 = 45 
— 1.16 X —.59 = 68 
— 1.55 X 1.18 = 1.83 
Sum = 7.06 
X sum 
"= no. of students 

7.06 

9 

= .78 


The correlation coefficient r is the average product of standard 
scores. For this reason, it is often referred to as the product-moment 
coefficient (“moment” meaning standard score in this case). In prac- 
tice, it would be excessively laborious to first obtain standard scores 
before computing the correlation, Consequently, numerous formulas 
are available for computing the coefficient from either raw scores or 
deviation scores. Several of these are described in Appendix C. It 
should be emphasized that there is only one product-moment co- 
efficient corresponding to any set of paired scores. The different 
formulas one sees in texts on statistics and testing are different com- 
putational approaches to the same value. All the product-moment 
formulas supply the same result, 

Interpretation. Is a correlation of .78 good or bad? What is a good 
relationship depends on the situation in which the correlation co- 
efficient is being used. Correlations range from 1.00 through zero to 
—1.00. A correlation of 1.00 means a perfect relationship—students 
are rank-ordered in exactly the same way on both tests. Correlations 
between zero and 1.00 indicate varying degrees of correspondence. A 
correlation of zero means that there is only a chance relationship be- 
tween the two tests. Some of the students who score high on the first 
test score high on the second, and others score low on the second, A 
zero correlation means that the first test supplies no information at all 


as to how well students will do on the second test, As good an estimate 
can be made by tossing coins for each student. 


63 
Correlation and Reliability 


A negative correlation means that there is an inverse relationship 
between the two tests. Students who rank high on the first test tend to 
rank low on the second test and vice versa for students who rank low 
on the first test. A good example of a negative correlation would be 
between school grades and days absent from school. Generally it 
would be expected that students who were absent a considerable num- 
ber of days would, on the average, make somewhat lower grades than 
those who seldom were absent. A correlation of —1.00 means a per- 
fect inverse relationship. Correlations between zero and —1.00 indicate 
varying degrees of inverse relationship. 

One way to think about the meaning of the correlation coefficient 
is in terms of the amount of “scatter” shown when the two sets of 
scores are presented graphically, as in Figure 4-1, Although it was 
convenient to use the scores of only nine students to illustrate the cor- 
relation problem, actually it would be rather meaningless to apply 
a group. In most important correla- 
tional problems there are at least one hundred students involved and 
in some studies more than one thousand. In a graphic presentation, 
the two scores of each student are represented by a point. Because the 
points tend to “scatter” over the graph, the graphic presentation is 
referred to as a scatter diagram. 

Figure 4-2 shows a typical scatter diagram relating the scores on an 
English achievement test to scores on a mathematics achievement test. 
If there were a perfect correlation between the two tests, all the points 
would fall exactly on the best-fit line (see Appendix C-2), and there 
would be no scatter about the line. The more the scatter, the less the 
correlation between the two tests. Figure 4-2 shows the amount of 
Scatter when the correlation is .67. Even though it is a relatively high 
correlation in terms of what is often found, it can be seen that some 
students are real exceptions to the general trend of correspondence. 
Some students do very well in English and only average in mathe- 
matics, Some are only average in English and superior in mathematics. 

The wider the scatter of points about the best-fit line, the lower the 
correlation. When the points tend to pack tightly about the line, the 
Correlation is high. When the points scatter all over the graph and 
there is no visible trend of correspondence, the correlation is near zero. 

A negative correlation has the same implications regarding amount 
of scatter as does a positive correlation of the same size. The only 
difference is that the best-fit line slants downward, going from left to 
tight on the graph, rather than upward as when a positive correlation 
is present, In many cases the sign of the correlation is arbitrarily de- 
termined by the way in which tests are scored, If, for example, num- 
ber of errors is scored in spelling and number of correct answers is 
scored in mathematics, a negative correlation is to be expected. Then, 


correlational analysis to so small 


64 


Basic Principles of Measurement and Evaluation 


by reframing the problem as one of relating accuracy in spelling 
(rather than errors) to accuracy in mathematics, the sign of the cor- 
relation can be reversed, e.g., a correlation of —.72 would become .72. 

In addition to serving as a very useful index of correspondence, cor- 
relational analysis provides a number of other important statistics, Al- 
ready mentioned was the best-fit line, which summarizes the trend of 


40 


z ° 
F . 2 
38 — . 8 . 
|; . : 
36 F 2 
L ° Borate se 
34 3 
k x 8 chug ie ete te 
32 — . : 
30 F . . 8 
f a eo e ° 
28 C 
ae 8 $ 
* 1 8 ts 
oe ee ae 7 
2 . 
2 24 z 
= 25 : 
= 


46 48 50 52 54 56 58 60 62 64 66 
English score 


oe 4-2. Scatter diagram of scores on English and mathematics achievement 
ests. 


relationship. In addition, estimates can be made of the amount of error 
entailed in forecasting Scores on one test from another. Some statistics 
relating to correlational analysis are discussed in Appendix C-2. 

The correlation coefficient is used for many purposes in addition to 
that of measuring the correspondence between two tests, One impor- 
tant use is to determine the predictive validity of aptitude tests. In a 


program of study in high school. Because the test is being used to 
serve a prediction function, the validity with which that function is 
served is determined by correlating test results with a criterion, which 


65 
Correlation and Reliability 


in this case would probably be grade-pcint averages earned in the 
special curriculum, Of course, it would be necessary to wait one or 
more years after the test was administered in order to obtain the grade- 
point averages. Each student then has a pair of scores, one on the pre- 
dictor test and a grade-point average. The correlation formulas (either 
the one given previously in the text or the ones given in Appendix 
C-2) can be applied. 

Another important use of the correlation coefficient is in educational 
research. As a simple example, it might be desired to study the cor- 
respondence between muscular coordination and intelligence. Scores 
from a coordination test could be correlated with scores from an in- 
telligence test. Another example would be to correlate school grades 
with amount of time students study. Thousands of such correlations 
have been computed in order to learn about the educational process. 

Returning now to the primary question of “What is a good correla- 
tion?” the best guide is to compare correlations with those which usu- 
arefully composed tests are used. Correlations 
between tests given at different times in the same class, e.g., ancient 
history, would be expected to be at least 50 but probably not higher 
than 80. Correlations between achievement test results, say for fourth 
graders, and intelligence test scores run about .75 or higher. Two 
forms of a carefully constructed, commercially distributed test (such 
as two forms of an intelligence test) would probably correlate at or 
above .90, 

Correlations between predictor tests and their criteria usually are 
lower than those reported above. A typical reading readiness test 
given to entering first-grade students would be expected to correlate 
about .60 with reading achievement grades at the end of the first 
grade. Intelligence tests and comprehensive achievement tests used to 
select students for honors programs and other accelerated curricula in 
high schools would be expected to correlate around .70 with over-all 
grade-point averages. Scholastic aptitude tests used to select students 
for colleges would be expected to correlate around .55 with grade 
averages in college. Personality tests tend to correlate much less well 
With school criteria than do ability tests. For example, the test con- 
Structor would probably be very happy to find a correlation of .30 be- 
tween a test of social adjustment and school grades or teachers’ ratings 


of adjustment. 


ally are obtained when e 


ducational research tend to be even smaller 


Correlations found in e 
than those found in measuring predictive validity. For example, one 
if any, between muscular 


Would expect only rather low correlations, i 

wt + 
coordination and intelligence and between grades and amount of time 
spent in study. Such variables are quite complex, and it is unreason- 
able to expect high correlations among them. The major question at 


66 


Basic Principles of Measurement and Evaluation 


issue is whether there is a perceptible correlation, In this context, cor- 
relations as low as .20 often are quite interesting, and correlations as 
high as .30 or .40 are sometimes considered major findings. 

Definite exceptions are sometimes encountered to the rules above— 
correlations of near zero between two tests in ancient history, predic- 
tive validity as high as .80, and correlations in educational research 
much higher than those quoted. But these are exceptions, and the 
fact that they are recognized as exceptions when they occur is owing 
to the sizes of correlations which typically are found. 

Returning finally to Mr. Martin, we can tell him that a correlation 
of .78 between his two tests represents a relatively high degree of cor- 
respondence. The correlation is as high, or higher, than that typically 
found in such situations. 

Cautions in Using Correlation. The correlation coefficient is not 
necessarily a measure of causation. For example, there is a positive 
correlation between the number of books in homes and the grades of 
students. It would be wrong to conclude from this that students make 
better grades because books are in the home. In well-to-do homes 
there are more books, typewriters, baths, golf clubs—more everything. 
Poor students would probably not make better grades (or not much 
better) if books were placed in their homes, and good students would 
not lower their grades appreciably if the books were removed from 
their homes. But even if correlations do not necess, 
tion, they provide many clues to causative 
“cause” to mean necessary and sufficient antecedent conditions, e.g., 
to bring about good grades). 

Like any other Statistic, the correlation coefficient computed on a 
particular set of scores is only an estimate of the “real” correlation. 
Suppose, for example, that you want to learn the correlation between 
the height and weight of all eighth-grade students in the United States. 
You would probably not measure the height and weight of that many 
students. Instead you would probably measure a sample of them, say 
5,000 students from schools across the country. The correlation found 
in the sample is of no importance in its own right. It is important only 
to the extent that it is an accurate estimate of the correlation that 
would be found if all the eighth graders were measured and the cor- 
relation computed, 

Previously it was said that the two important characteristics of any 
sample is that it be unbiased and that sufficient numbers of students 
be involved. If the sample is unbiased, the precision with which the 
sample correlation estimates the real (population) correlation varies 
with the number of students involved, Correlations based on 100 stu- 
dents provide more accurate estimates than those b 
dents. Correlations based on 1,000 students would pr 


arily measure causa- 
connections (using the word 


ased on 10 stu- 
ovide much bet- 


67 
Correlation and Reliability 


ter estimates. When the sample is as large as 10,000, estimates are so 
accurate as to be exactly correct for all practical purposes. 

Any particular correlation should not be considered as an exact 
point but rather as a range extending above and below the sample 
value. If, for example, a correlation of 50 is found in a sample of 100 
students, the proper view is that the real correlation lies in a region 
centering on .50, perhaps being as high as .70 or as low as .30. Such a 
region is referred to as a confidence band, a region in which we feel 
confident that the real (population) correlation lies. For this purpose, 
a 95 per cent confidence band is often employed, which means essen- 
tially that the odds are 95 out of 100 that the population correlation 
is somewhere within the band. To explain how such confidence bands 
are obtained would go beyond the scope of this book. (The interested 
reader should consult the statistics texts listed in the Suggested Addi- 
tional Readings at the end of the chapter. Some elementary concepts 
are discussed in Appendix C-6.) The reader can, however, grasp the 
idea of confidence bands and see why correlations must be viewed in 
this way, 

An example will help show how confidence bands are used. With a 
sample of 100 and a sample correlation of .50, the band extends from 
34 to .63. Another way of saying it is that the odds are 95 out of 100 
that the real value Ties somewhere between .34 and .63. (The statis- 
ader will note that the phraseology does not 
al meaning, but it is correct for all prac- 
size is increased, the confidence band 


tically Sophisticated re: 
impute the precise statistic 
tical purposes.) As the sample 
shrinks, which is another way of saying that we can have more con- 
fidence in correlations obtained from larger samples. 

What happens if the confidence band covers zero? For example, 
with a sample of 20 and a correlation of .25, the 95 per cent confidence 
band extends from —.22 to .62, which means that the population cor- 
relation might be zero or even negative. When the confidence band 
Covers zero, there is no reason to believe that the population correla- 
tion is other than zero. To demonstrate that the correlation is other 
than zero would require a larger sample. In many correlational studies, 
the sample is small, and the confidence band is so large that the re- 


sults are inconclusive. 

The reader who has not studied statistics will, of course, not under- 
stand the intricacies of how confidence bands are generated, This is 
not the important consideration here. The important points are to 
think of correlations obtained from samples as representing bands 
rather than exact points and to compute such bands using formulas 
Presented in statistical texts. The reader who has a special interest in 
the sampling error of correlation coefficients should consult the statis- 
tics texts listed in the Suggested Additional Readings. 


68 


Basic Principles of Measurement and Evaluation 


Reliability of Measurements 


As was mentioned previously, some luck or chance is involved in all 
test scores. To the extent to which such chance factors predominate, a 
test is said to be unreliable. Conversely, when the influence of chance 
factors is slight, a test is said to be highly reliable. The chance ele- 
ment in test scores is referred to as measurement error. Measurement 
error is always bad in the sense that it tends to obscure the actual 
abilities of students. We cannot observe measurement error directly, 
but we can indirectly observe its effect, which is to make measure- 
ments inconsistent from occasion to occasion. That is, if a particular 
student happens to be lucky on a particular day on a particular test, 
he is not likely to be so lucky on another day taking another test. 
Similarly, the student who is somewhat unlucky on the first occasion 
is not likely to have such bad luck the se ‘ond time. Consequently, by 
comparing the scores made on two occasions and measuring the rela- 
tive degree of consistency, we can indirectly determine the amount of 
measurement error involved in the particular type of instrument. 

It should be obvious that if measures are completely inconsistent, 
then they are worthless. For example, if Mr. Martin were to give two 
final examinations in ancient history rather than one and these were 
separated by an interval of several days, the correlation between the 
two tests would indicate the reliability, If the two tests correlate zero 
or near zero, then something is definitely wrong. Susie Smith would 
have received an A if the first test had been counted but a C if the 
second test had been counted instead. In contrast, Bill Jones would 
have failed the course if the first test had been counted but would 
have received a B if the second test had been counted, Then 
either one, or more likely both, of the tests are completely domi- 
nated by measurement error and are, consequently, worthless. Using 
either of the tests would be no better than flipping coins to determine 
grades. : 

A zero correlation between two such classroom tests practically 
never occurs in practice, but it is often the case that two such tests do 
not correlate highly, meaning that a sizable portion of measurement 
error is present, As was mentioned previously, reliability coefficients 
obtained from many of the carefully constructed, commercially dis- 
tributed tests are .90 or higher. Although such careful test construction 
efforts are not possible for most classroom tests, it would be expected 
to find reliability coefficients for final examinations to be at least as 
high as .75, and preferably higher than .85. To the extent to which 
reliability coefficients are lower, say as low as 50, it means that a very 
poor test is being used. Some of the sources of such measurement 
error will be discussed in a following section. 


69 
Correlation and Reliability 


True Scores. In discussing test reliability, it is useful to think in 
terms of hypothetical frue scores, the scores that people would make 
if tests were perfectly reliable, if no chance factors were involved. 
There is no direct way to measure such true scores, but an approxima- 
tion is available. If we administered many similar forms of a test on 
many different days, the average score would closely approximate an 
individual’s true score. Over these many occasions, chance factors 
would tend to average out. The more tests averaged, the less would 
be the measurement error. This is why it was said previously that two 
tests are better than one, and three or four, or even more, tests would 
be better still, if it were not for the practical difficulties involved in 
employing so many tests. 

The scores obtained on the different occasions would tend to range 
about the true score, luck having it that, on some days and on some 
forms, the student would score higher than his true score, and on other 
days and other forms, the student would score lower than his true 
score. Because all such errers tend to be normally distributed, the ex- 
pectation is that the obtained scores would be normally distributed 
about the true score, which is illustrated in Figure 4-3. 


dispersion of 
obtained scores 
for person A 


_Dispersion of 
obtained scores 
for person B 


Low High 


True score 


True score 
for person A 


for person B 
Trait continuum 


Figure 4-3. True scores and distributions of obtained scores for two persons. 


Figure 4-3 shows the hypothetical true scores and distributions of 
obtained scores for two students. Student A has a relatively high true 
Score, and student B has a relatively low true score. Of course, it 
Would be totally impractical to give enough alternate forms to demon- 
Strate the normal distributions of obtained scores indicated in the 

gure. However, the normal distributions are approximately what 
Would be expected. In the figure, the normal distributions of obtained 
Scores are shown to be the same width, i.e., they have the same stand- 
aa deviation. The standard deviation of the distribution of obtained 
ak a direct measure of the amount of measurement error or un- 
aes, i. present. If the standard deviation is large, the measure is 
the pies unreliable. Conversely, if the standard deviation is small, 

asure is relatively reliable. Although it is possible, and some- 


70 


Basic Principles of Measurement and Evaluation 


times likely, that the standard deviations of obtained scores would be 
different for different students, in most practical work it is assumed 
that all the standard deviations of errors are approximately the same. 
Consequently, a typical standard deviation is used to describe the 
errors likely to be shown by all students. This typical measure of er- 
ror is referred to as the standard error of measurement. When the re- 
liability is low, the standard error of measurement is large, and vice 
versa when the reliability is high. In a perfectly reliable test, the 
standard error of measurement would be zero. Obviously a large 
standard error of measurement is “bad” because it reduces the preci 
sion with which the test can be used for any purpose. Although there 
is no direct way to compute the standard error of measurement, there 
is an indirect Way to estimate it, which is described in Appendix C-4. 

Interpretation of Obtained Scores. A score obtained on a particu- 
lar test on a particular day should not be considered as an exact point, 
but rather it should be thought of as representing a zone in which the 
student's true ability lies. That is, if a particular student scores at the 
60th percentile, it is wrong to conclude that this is his exact standing. 
It is more appropriate to say that his true score lies somewhat near 
the 60th percentile, how near depending on the reliability of the test. 
Some methods for determining the reliability will be discussed in a 
later section. If the reliability is low and, consequently, the standard 
error of measurement is large, then a very broad band must be con- 
sidered. In that case, the students true score might actually be as high 
as the 90th percentile or much below the 50th percentile. When the 
reliability is high and the standard error of measurement is small, a 
much narrower band of error needs to be considered. Then, we can 
confidently interpret the student’s true score as lying somewhere in a 
band ranging from say, near the 50th percentile to near the 70th per- 
centile. Even in the most reliable test, there is a band of error that 
must be considered, 


A very important point to comprehend, one that is much misunder- 
stood, is that scores obtained on any particular test, on any particular 
day, are somewhat biased, High scores are too high, and low scores 
are too low. In looking at the results of any particular test, we are not 
only witnessing the abilities of the students, we are witnessing the 
relative amounts of luck that they had. As a group, those who made 
very high scores are not only high in ability, but they were also some- 
what lucky on that Particular occasion. Conversely, as a group, the 
students who made very low scores are not only low in ability, but 
they also had some bad luck. If an alternate form of the test is ad- 
ministered on another day, the students who scored very high the first 
time will, as a group, come down somewhat, The students who scored 
very low the first time will, as a group, come up somewhat. This effect 


71 


Correlation and Reliability 


is referred to as regression toward the mean; i.e., the people who were 
very far from the mean on the first occasion, either very much above 
or very much below, will, as groups, tend to regress, or move toward, 
the average value. A moments reflection will show that there is noth- 
ing else that could happen. If a student is at the 99th percentile on 
the first occasion, he cannot possibly score any higher, relative to his 
classmates, on an alternate form administered on another day. He can 
either remain at his high level or score lower. Similarly, a student who 
is at the zero percentile on the first occasion cannot go any lower the 
second time. He can either move upward in his standing or remain at 
the bottom of the heap. This tendency to regress toward the mean 
occurs to some extent for students at all levels. Students who score at 
the 90th percentile, will, as a group, tend to score slightly lower on the 
second occasion, and similarly for people at the 80th, 70th, and 60th 
percentiles. Students who score below the mean at the 10th, 20th, 30th, 
and 40th percentiles, will, as groups, tend to move up toward the mean 
on the second occasion. 

The principle of regression toward the mean does not necessarily 
occur for each person. One student who scores at the 90th percentile 
on the first occasion may actually go up on the second occasion to, say, 
the 95th percentile. A student who scores at the 10th percentile on the 
first occasion may actually go down. However, it should be obvious 
that there is not as much room for the former student to go up on the 
second occasion, and not as much room for the latter student to go 
asion. All scores above the mean tend to be 


down on the second occ 
higher than they 


somewhat biased upward; i. e., they are probably 
should be, Scores below the mean tend to be biased downward, 

In percentile terms, or in terms of standard scores, the further an 
obtained score is from the mean, the more it is likely to be biased. For 
example, if we took a rather extreme group of students, say, all those 
who score exactly at the 90th percentile, the odds are that their average 
score on an alternate form would place them at, say, the 80th per- 
centile, depending on the reliability of the test. In contrast, a much less 
extreme group, for instance, all those lying at the 70th percentile, 
Would, in percentile units, regress less toward the mean on the second 
that, as a group, they would show 


occasion. For example, we might find 
cond occasion, regressing toward 


an average percentile of 65 on the se 
the mean by only half as much as the more extreme group. 

the test, the more scores tend to be biased. For 
example, if test scores were determined by flipping coins on the first 
Occasion and the alternate form consisted of flipping coins on another 
day, the first set of scores would be completely biased. Even if coins 
Were used to determine scores, some students would lie at the 90th 
Percentile, others at the 80th, and so on. For students who were at the 


The less reliable 


72 


Basic Principles of Measurement and Evaluation 


90th percentile on the first occasion, the best bet is that, as a group, 
they will average out at the 50th percentile on the second occasion, 
there being no reason for them to score either above or below average 
on this chance test. The more reliable the test, the less scores tend to 
be biased. 

Regardless of how high or low the reliability, some element of bias 
is present. Only the scores of students who score exactly at the mean 
are completely unbiased. There is no reason to believe that they would, 
as a group, go either upward or downward on the second occasion. 
Scores which are relatively close to the mean embody relatively little 
bias, On most of the well-constructed, commercially distributed tests, 
where the reliability is high, the bias is slight. Between the 20th and 
80th percentiles, the bias is so slight that it can be ignored for all 
practical purposes. Beyond those extremes, the bias is not inconsidera- 
ble and should be taken into account in interpreting test scores. 

There are two ways to take the bias into account in interpreting test 
results. First, statistical corrections can be made for the probable bias. 
Methods for doing this are discussed in Appendix C-4. More important 
for the teacher is to remember that extreme test scores tend to be 
biased, even on the best commercially distributed tests, and to take 
this into account when interpreting test scores, For example, after an 
achievement test is administered to fourth graders, the teacher should 
remember the probable bias of extreme scores and make interpretations 
accordingly. If she sees that Johnny is at the 4th percentile on the test, 
she should say to herself, he is probably doing poorly, but the odds are 
that he has slightly more ability than indicated on the test. (As was 
mentioned previously, there is some likelihood that Johnny has even 
less ability than indicated, but the odds are that his real ability is 
slightly higher than shown on the test.) Conversely, if Susie has a score 
at the 98th percentile on the achievement test, the teacher should say, 
Susie is a very bright child, but she probably is not quite as bright as 
indicated by the score. 

People who do not understand the bias inherent in extreme scores, 
or fail to take the bias into account, make some very poor interpreta- 
tions of test results. For example, suppose that Mr. Martin performs an 
experiment with the students who score very low on an achievement 
test. Suppose that he takes all the students who score at or below the 
20th percentile, gives them a month of special instruction, and then 
administers an alternate form of the achievement test. (Such alternate 
forms are available for many commercially distributed tests.) Whereas 
on the first test the average percentile was 10, on the second test the 
average percentile is 20. Mr. Martin takes the result as evidence that 
his special instruction was highly effective. Because of the bias inherent 
in extreme scores, this is a completely erroneous conclusion. Owing to 


73 
Correlation and Reliability 


the tendency of extreme scores to regress toward the mean on another 
occasion, the special group of students would probably have gone from 
an average 10th percentile to an average 20th percentile purely because 
of measurement error, without the opportunity for special instruction. 

As another example of how misinterpretations can be made by fail- 
ing to take measurement-error bias into account, suppose all students 
entering the fourth grade are given a comprehensive reading test at 
the beginning of the school year and an alternate form of the test at 
the end of the year. In comparing scores on the two occasions, the 
teacher might be surprised to find that two-thirds of the students who 
made very high scores on the first test tend to make lower scores on 
the second test, and two-thirds of the students who made very low 
scores on the first test tend to make higher scores on the second test. It 
is tempting for the teacher to relate this finding to the educational 
process and to the particular fourth-grade curriculum, but it would be 
completely erroneous to do so. One such erroneous conclusion would 
be that brighter children tend to get duller, and duller children tend to 
get brighter as they both grow older. However, such regression toward 
the mean is completely to be expected, regardless of what happens in 
the interval between the two tests. It is caused by the chance factors 
in test scores and not by the intervening instruction. 

Teachers may be somewhat bewildered by the seemingly complex 
statistics which are used to assess the amount of bias and the relative 
error which is inherent in all test scores. (Those who are interested 
in some of the related statistics will find them discussed in Appendix 
C-4.) However, the statistical arguments are not the important points 
for the teacher to understand. First, it is important to realize that 
extreme scores tend to be biased and to remember this when interpret- 
ing particular test results. Secondly, even after the bias is taken into 
account, it must be remembered that there is a band of error surround- 
ing all test scores, and the width of the band depends on the reliability 
of the test. Many of the manuals for commercially distributed tests 
Provide excellent discussions of the amount of error to be expected. 
They supply useful rules of thumb for gauging the zones in which it 
1S safe to interpret test results. 

Teachers are not likely to perform complex studies of reliability or 
make elaborate statistical estimates of the effect of measurement error 


on classroom examinations. However, they can remember what meas- 


urement error tends to do to test scores and try to take this into 
account in looking at particular test results. l 

Effect on Validity. In the previous section we discussed the effect 
of Measurement error, Or unreliability, on test scores. In this section 
we will discuss the effect of measurement error on predictive validity. 
In Chapter 2 it was said that when a test is intended to serve a predic- 


74 


Basic Principles of Measurement and Evaluation 


tion function, validity is determined by the correlation between the test 
and its intended criterion. Examples of tests used in prediction are (a) 
tests of general intelligence used to decide on the admission of a 514 
year old child to the first grade, (b) tests of reading readiness to help 
plan the instruction of students in primary grades, (c) comprehensive 
achievement tests to help in assigning students to special courses of 
study in high school, and (d) scholastic aptitude tests to help in 
counseling students about college training. To the extent to which 
such predictor tests have large amounts of measurement error, they 
provide poor forecasts of how students actually will perform in particu- 
lar courses of study. For these reasons it is important for teachers to 
understand the effect of measurement error on predictive validity. 

Measurement error, or unreliability, always works to obscure or, as 
We say, attenuate any type of scientific lawfulness. Whatever “real” 
lawfulness there is in nature will appear blurred if relatively unreliable 
measures are used to chart that lawfulness. When dealing with predic- 
tor tests, this means that, to the extent to which the test has much 
measurement error, it cannot do a good job of predicting a criterion. 
Measurement error tends to attenuate correlations; i.e., it makes them 
closer to zero, An example may help to show how this works. 

Figure 4-4 shows a hypothetical relationship between a predictor test 
and its criterion, As would be the case only in hypothetical circum- 
stances, the test predicts its criterion perfectly, and all the scores lie 
on a straight line. Let us see what happens when some measurement 
error is introduced into the test scores. What we will do is to flip a 
coin for each score in turn, If it turns up heads, we will add three 


60. 


F . 
50 . 
ee al . 
3 40h . 
© . 
8 30 . 
8 
* . 
è 20 
5 20 4 
a . 
e 
2 10F . 
5 Er 
8 0 & — a eee 1 1 


o 


2 4 6 8 10 12 
Predictor test score 


Figure 4-4. Relationship betwee: 


n a predictor test and its criterion before measure- 
ment error is added. 


75 


Correlation and Reliability 


points to the score; and if it is tails, the score will be left as it is. 
Flipping the coin for each of the test scores in turn, we find the follow- 


ing: 


Original score Coin flip New score 
0 H 3 
1 T 1 
2 H 5 
3 H 6 
4 H 7 
5 T 5 
6 T 6 
7 H 10 
8 H 11 
9 H 12 

10 * 10 
11 T 11 
12 H 15 
13 T 13 


scores, with the included error component, are 
The criterion scores have, of course, not 
been changed by the random additions to the predictor test scores. In 
Figure 4-5 it can be seen that there is no longer a perfect correlation. 
The effect of adding error to the scores is to lower the correlation. 
Starting off with a perfect relationship, there is no way for the rela- 
tionship to change except to a lower correlation. The important point 


In Figure 4-5 the new 
plotted against the criterion. 


40 


w 
ô 


Criterion score, school grades 
8 
S 
T 


8 
EF E 
7277 wey ame amen Cl 


Predictor test score 
n a predictor test and its criterion after measure- 


Figure 4-5. Relationship betwee: 
ment error is added. 


76 


Basic Principles of Measurement and Evaluation 


is that the addition of random error will tend to lower the correlation 
no matter what it is originally. If we had pictured a scatter plot with a 
correlation of .50, the random addition of score points would have 
tended to make the correlation nearer zero. Likewise, if we had a 
correlation of —.50, the random addition of score points would have 
tended to make the correlation nearer zero. 

With as few scores as those shown in the example above, it is possi- 
ble that the random changes in scores would either leave a correlation 
unchanged or could, in a very rare circumstance, make the correlation 
higher. However, the odds are against anything but a lowering of the 
correlation, and if the number of students is as large as one hundred, 
it can be predicted not only that the correlation will be lower but with 
fair accuracy just how much it will be lower. 

Measurement error in educational tests works very much the same 
as the chance component introduced by the coin flips. Because un- 
reliability tends to lower all correlations, it will lower the correlation of 
a predictor test with its criterion and thus lower the validity of the test. 

The relationship between test reliability and test validity should 
be carefully considered, but reliability does not necessarily ensure high 
validity. Reliability concerns the consistency of what is measured, 
regardless of whether what is measured is good or valid in any sense. 
For example, we might use the weights of children to predict school 
grades. Whereas, weight might be determined very precisely and be 
quite accurately repeatable, weights of children would serve as a very 
poor predictor of school grades and thus not be valid. For another 
example, grades in ancient history could be determined by measuring 
how far students could throw their textbooks. Such tosses would 
probably prove to be highly reliable. That is, if we asked them to toss 
their textbooks on another occasion, or to toss another textbook, the 
two “tests” would probably correlate highly and thus be highly reliable. 
It is not necessary to belabor the point that tossing textbooks would 
not provide a valid assessment of ancient history. 

Even if reliability does not ensure validity, reliability does place a 
limit on the extent to which a test is valid for any purpose. In order to 
have high validity, it is absolutely necessary to have relatively high 
reliability. High reliability is a necessary, but not sufficient, condition 
for high validity. If the reliability is zero, or not much above zero, then 
the test is invalid, regardless of the type of validity intended (those 
discussed in Chapter 2). Consequently, as a prelude to determining 
the validity of tests (achievement tests, aptitude tests, measures of 
psychological traits), it is important to first determine the reliability. 
If the reliability is low, it must be raised (by ways to be discussed in a 
following section) before the test can possibly achieve the desired 
validity. 


77 
Correlation and Reliability 


Statistical estimates can be made of the effect of unreliability on 
predictive validity. For example, suppose that you have a predictor 
test which has a reliability of .70 and in its present form correlates .30 
with its criterion. When the reliability is increased, the predictive 
validity will increase, which means that the correlation will be higher 
than .30. Formulas are available for estimating how much predictive 
validity will be increased by an increase in reliability. These are dis- 
cussed in Appendix C-1. With the previous example, if the reliability 
is increased from .70 to .90, the estimate is that the predictive validity 
will increase from .30 to .34. Although it is easiest to demonstrate the 
effect of unreliability on the validity of predictor tests, this is not the 
type of test with which teachers are most concerned. They are more 
concerned with the results of achievement tests and day-to-day class- 
room examinations. Even though it is not as easy to demonstrate, 
reliability has the same effect on instruments of this kind. If two 
alternate forms of a classroom examination or two alternate forms of 
an achievement test do not correlate highly, and thus are said to be 
relatively unreliable, they cannot be highly valid as assessments. If the 
reliability of assessments is not greater than, say, 60, then they cannot 
be valid measures of school performance. Only if the reliability is above 
80, and preferably above .90, is it possible for tests to be highly valid 
assessments of performance. When teachers fail to heed some of the 
cautions which will be mentioned later in this chapter, and in other 
chapters, their tests often are far less reliable then they should be, and 
often so unreliable as to be nearly worthless assessments of perform- 
ance, 


Sources of Unreliability 


Tt was stated that any random influence on test scores would cause 
unreliability. There are many ways in which this can happen in prac- 
tice. Some of the most prominent sources of unreliability are discussed 
in the following sections. 

Errors Due to Day-to-Day Fluctuations. The largest source of 
Measurement error in most tests is due to day-to-day fluctuations in the 
individual which lower or raise scores. Changes in mood, physical well- 
being, happenings in the home, and many other events contribute to 
inconsistencies in test scores from one occasion to another. The abilities 
and personality characteristics of children change somewhat from day 
to day, Added to the rather enduring pattern of abilities and personal 
features which characterize the child, he has his definite ups and 
downs, This is seen most markedly when the child is ill or has had a 
very bad time with Mom and Dad before coming to school. On the 
Positive side, a child will be on top of the world one day, very alert, 


78 


Basic Principles of Measurement and Evaluation 


amiable, and mentally sharp. On the day on which a test is given, the 
child who is “up” is lucky, and the child who is “down” is unlucky, and 
these relative amounts of luck are reflected in test scores, Consequently, 
the test given on one occasion is not totally reliable. It is influenced by 
the happenstance of events occurring during the previous twenty-four 
hours, or previous several days. 

The extent to which such day-to-day fluctuations in the child result 
in unreliability of test scores depends on the type of test. In some types 
of personality tests, the influence can be relatively strong. For example, 
if the test contains items like “Do most people like you?” and “Are you 
as smart as most of your friends?” the child in a bad mood on a par- 
ticular day is likely to make a rather different score from what he 
would make on another day in another mood. In contrast, most tests 
of ability, such as intelligence tests, achievement tests, and classroom 
examinations, are influenced to some extent by day-to-day fluctuations 
in the child but not nearly to the same extent as some of the per- 
sonality tests. 

Errors Due to the Sampling of Content. In Chapter 2 it was said 
that assessments depend on a sampling of the important content in a 
course or larger unit of instruction, For example, in a final examination, 
a teacher tries to pose questions which range broadly across important 
material included in the lectures, the text, the supplementary reading, 
and other sources. Even so, there is an element of chance involved in 
what questions are included in the test. Every student has had the 
feeling that he was lucky on a particular test, that of the many topics in 
the course, the instructor happened to ask about things which he knew. 
Another student, with the same level of understanding about the whole 
course material, may have been unlucky, in that he happened not to 
know about the particular questions, If the teacher were to start over 
and compose another examination, the luck of these students might be 
reversed, and their scores might be somewhat different. 

As in all sampling problems, the more questions there are, the less 
unreliability will come from the sampling of questions. If the test con- 
tains only 10 questions, luck will probably have a greater influence on 
test scores than if 100 questions are used instead. Stretching this point 
of view to the extreme, as larger and larger numbers of questions are 
included, the instructor would approach including everything im- 
portant, and it no longer would be only a sample. Although there is a 
practical limit to the length of tests, generally speaking, longer tests 
are more reliable than shorter tests. For this and for other reasons, it 
is often said that a long test is a good test. Even though one can easily 
think of exceptions to this rule, e.g., 1,000 questions concerning base- 
ball as an intended measure of ancient history, the saying has a great 
deal of merit. 


79 
Correlation and Reliability 


Errors in Scoring Tests. On multiple-choice tests, the errors in 
scoring are purely mechanical. If the test is scored by hand, it is possi- 
ble to accidentally score some correct answers as incorrect and vice 
versa, Also, errors can be made in counting up the number of correct 
and incorrect answers for each student. If tests are machine-scored, 
an improperly functioning machine can add a considerable amount of 
Measurement error to test scores. 

The errors in scoring an essay examination are more subtle in charac- 
ter, but they also generate unreliability. Whereas errors in scoring are 
relatively minor in multiple-choice tests, they are very prominent in the 
subjective grading of essay examinations. If three different persons 
teach the fourth grade ina particular school, they will probably have 
at least slightly different ideas about what kinds of answers merit good 
and poor marks. Consequently, the grade that a student gets is partly 
dependent on the happenstance of which teacher he has. 

In addition to differences in grading standards, there is an element 
of measurement error built into the individual teacher. If a teacher 
regrades a set of essay examinations after a period of time and there 
are no identifying marks to indicate what the earlier grades were, the 
grades will be somewhat different the second time. Even within one 
grading session, the teacher often changes his standards as he marks 
the test, He might have a strict standard at first, but as he goes through 
the Papers and sees that none of them measure up, he is likely to 
become more lenient. Consequently, the grade that the student obtains 
partly dependent on the chance appearance of his test in the order 
in which they are graded. Some ways of improving the reliability of 
essay examinations will be discussed in Chapter T. ; 

i Errors Due to Guessing. Errors due to guessing occur in examina- 
tions where the student is asked to identify the correct answer from 
two or more choices. For example, in a true-false examination, a student 
who knew nothing at all about the subject matter would be able to 
Suess approximately half of the answers, even if he went to the extreme 
of flipping a coin. If, instead of using true-false alternatives, a student 
is asked to choose the correct alternative among four, guessing is not 
as prominent, but it still plays a part. What students typically do is to 
80 through an examination marking the answers of which they are 
relatively sure, and then go pack to those about which they are un- 
certain. If (following a rule which will be recommended in Chapter 7) 
all students attempt all items, some students are bound to make wild 


guesses on at least some of the questions. Even 50. there is some 


Probability that the student will choose the correct alternative. Another 
1 y informed about the subject 


st 

Student who may be equally well or poorl 

a Me may choose the wrong alternative purely by chance. Therefore, 
te guessing that occurs on multiple-choice tests acts as another form 


80 


Basic Principles of Measurement and Evaluation 


of measurement error. If students were asked to take the same test 
over and their memory of the previous testing were erased, they would 
likely guess differently, and their scores on the two occasions would 
not be exactly the same. 

A free-response (fill-in) test, one in which the student supplies the 
correct answer, is theoretically more reliable than a multiple-choice 
test. However, there are few subject matters which can be cast in the 
free-response form, the requirement being that there be only one word 
or term which represents the correct answer, Consequently, the fact 
that some measurement error is due to guessing on identification-type 
questions is not a sufficient reason for doing away with true-false and 
multiple-choice forms. The important consideration is that the measure- 
ment error due to guessing is less influential as the number of alterna- 
tive answers for each question is increased. On a true-false test, purely 
by chance the odds are fifty-fifty of getting the correct answer. If ten 
alternative answers are used instead, the odds are only one out of ten 
of getting the correct answer, and chance has less influence on the test 
results. If the subject matter permits, a test will be more reliable if 
four or five alternatives are used for each item instead of only two or 
three. . 

Because of the difficulties in composing tests and the limited amount 
of time available for testing, some compromise must be reached 
between the need to have many items and many alternative answers 
for each item. That is why standardized tests have sought a com- 
promise solution by using about four or five alternatives, with as 
many items as time will permit. 

Errors Due to Test Standardization. Standardization is the essence 
of testing. A test is said to be standardized when the rules for taking 
the test are made sufficiently clear that all children are given essentially 
the same task. When a test is made easier for some children, and more 
difficult for others, or when the rules for taking the test are not made 
clear, some measurement error is introduced into test scores. To ensure 
that tests are well standardized, test manuals for most commercially 
distributed instruments spell out the rules in some detail. If it is to be 
a timed test, the exact amount of time is stated, and this should be 
followed religiously. When one teacher allows his students a few extra 
minutes, it makes the over-all use of the test somewhat unreliable. 

Before the test begins, students should be told whether or not they 
should attempt all items, even those about which they are unsure, and 
if, after answering the last question, they are permitted to go back and 
check answers given earlier. If special materials are required in taking 
the test, such as special pencils, rulers, or other equipment, these should 
be placed on desks before the testing begins. If students are allowed to 
ask questions during the test, it must be decided what kinds of ques- 
tions will be answered and how students are to submit their questions. 


81 
Correlation and Reliability 


In addition to being a collection of questions, each test embodies a 
set of rules and procedures which must be carefully followed. This is 
so not only for well-constructed, commercially distributed tests, but 
also for classroom examinations. The rules should be thought out in 
advance by the teacher and applied uniformly to all students. In par- 
ticular, it is important to write out the test instructions in advance, 
which are then either read by students or given orally by the teacher. 
These should make all the testing procedures and rules absolutely 
clear, Some suggestions for writing test instructions will be given in 
Chapter 5. If the rules and procedures are not made absolutely clear, 
by the teacher, some students will be 


Or are 4 K 
r are not uniformly enforced 
a disadvantage. This 


given an advantage, and others will be given 
Inevitably results in measurement error. 
j Errors Due to Long-range Instability. Previously it was said that 
gay-to-day fluctuations in the individual's performance constitute a 
Primary source of unreliability. Another consideration is whether scores 
remain stable over long periods of time. Would an eight-year-old child 
demonstrate the same 1Q that he obtained when he was six? Would 
adults show the same scores on interest tests that they obtained when 
they were in their early teens? 
Scores on some types of tests tend to remain fairly stable over periods 
of one or more years. Scores on other types of tests tend to change 


markedly during such periods of time. Whether or not long-range 
of test and the way in which 


stability is expected depends on the type 
tests tests are expected to remain stable 
Over relatively long periods of time. If a child’s IQ goes up or down 
by as much as 15 points over a period of two years, it makes the test 
very difficult to use. On the other hand, there are measures which are 
expected to change in relatively short periods of time. If an experiment 
Is being penres to alter the attitudes of a group of students toward 
Some political issue, it is expected that a test of attitudes given after 
the experiment will show differences from a test given before the ex- 
Periment. Changes are expected on tests measuring interests in recrea- 
2 d children play at different games than 
Ids have even different recreational 
expects the interest pattern shown at 


are used. Scores on intelligence 


tional activities. Nine-year-0 
car. olds, and twelve-vear-o 
hee Consequently, no one < . 
age to remain stable over periods of years. 
dare are two important considerations A es the 1 eae 
Stability of scores. Fir -vestigation of long-range stability is an 
eet pase te For ae studies have 
Deen made gf 04 coal and decline of intelligence, and long-range 
investigation ] = 15 ve! i le of the stability ol teres over Dads 
a s have been made ; Į 

of ten and twenty years. Some of these developmental studies will be 

ee The second important point is that, if 


Menti 8 
$ ntioned in later chapters. 
Scores are actually used over long periods of time to make decisions 


82 
Basic Principles of Measurement and Evaluation 


about students, then any instability over those periods of time acts 
like the other sources of measurement error to make the test results 
unreliable. For example, if an intelligence test is given to children in 
the second grade and the results are used to make decisions about 
students three and four years later, it is essential that the test results 
be stable over that period of time. 

Suppose that in the second grade Susie manifests an IQ of 125. Three 
years later in the fifth grade, the earlier test result is used as one of the 
bases for assigning Susie to a special section. But instead of using the 
earlier test result, suppose that Susie were retested, and this time she 
shows an IQ of only 110. It should be obvious that the more recent test 
result would be more representative of Susie’s actual abilities at the 
time, and to have used the test result obtained earlier would have been 
incorrect. 

If scores are used over a period of time as an indication of an indi- 
vidual’s standing and the scores are found to be unstable over that 
period of time, the instability usually generates unreliability in the 
same sense in which the other sources of error contribute to unrelia- 
bility. The important consideration is that it is generally unsafe to use 
test scores over long periods of time, say over more than one year’s 
time. It is far wiser to retest periodically, although such constant testing 
can be a real drain on school resources. Wherever possible, it is wisest 
to administer some of the major instruments at least every other year. 
Otherwise individuals are likely to change considerably, and test results 
will be outmoded and inaccurate, Theoretically, the best time to 
administer tests is immediately before decisions are to be made. 
Although this ideal seldom can be carried out in practice, it is wise to 
remember that test results usually become antiquated after a period 
of several years. 


Estimating the Reliability 


In the previous section we discussed a number of kinds of “gremlins” 
that cause test scores to be unreliable. The effect is much like that of 
tossing coins or dice to introduce an element of pure chance into test 
results. Such measurement error is bad because it introduces a zone of 
uncertainty about particular scores, and it obscures the relationship 
between any test and its criterion, Consequently, one of the very 
important steps in constructing a commercially distributed test is to 
study the reliability. Such studies consist essentially in learning the 
consistency with which scores are maintained on alternate forms of the 
test over various intervals of time. The extent to which two such sets 
of scores correlate is called the reliability coefficient. Previously, it was 
said that for commercially distributed tests it is expected that the 


83 


Correlation and Reliability 


reliability coefficient will be at least as high as .80, and preferably .90 
or higher, 

Theoretically, one could argue that there are many different relia- 
bility coefficients for each test, corresponding to measures of the various 
sources of error discussed previously. However, in practice the usual 
effort is to obtain the reliability coefficient in such a way that it meas- 
ures as many of the potential sources of error as possible. This will 
show the maximum influence that chance factors could have, and if, in 
Spite of these, a test still produces consistent results, we can feel rela- 
tively sure that measurement error has only a slight influence. First 
Will be discussed a relatively ideal way to measure reliability, and then 
some useful approximations will be discussed. 


Alternate-form Reliability. The most comprehensive measure 
correlating the scores which students make on 
he commercially 


of 


reliability is obtained by 
two alternate forms of the same test. For many of t 
distributed tests there are two or more forms which are intended to 
Measure the same thing. For example, to measure spelling achieve- 
ment, the test publisher will construct two tests, each containing fifty 
Words. These might then be referred to as forms A and B of the fourth- 
grade spelling achievement test. Care is taken to make sure that both 
lists contain a wide variety of words and that both lists are at approxi- 
mately the same level of ‘difficulty. Test experts employ some special- 
ized statistical procedures to select items for alternate forms which 
help ensure that they measure much the same thing. Such alternate 
forms are useful, not only to measure reliability, but also to help solve 
Numerous practical problems. For example, if form A is given at the 
beginning of the year, form B can be given at the end of the year to 
Measure the over-all improvement of the group. Alternate forms are 
also useful in case something goes Wrong in the first testing and 


another form must be used. 
ve he correlation between two such 
Lasure of reliability. Preferably, 
ministered to the same students with : : 
Will allow for the day-to-day fluctuations in the student which consti- 
tute a primary source of unreliability. If the two forms were ad- 
Ministered on the same day, or only one or two days apart, this would 
Not allow sufficient time to tap the normal ups and downs in students 


Performance. 


alternate forms offers an excellent 
the two forms should be ad- 
at least a two-week interval. This 


a check on the short-range fluctuations in 


hod does a good job of assessing the 
It provides a good check on errors due 
possible to draw two samples of 
lling words, and students make 


ndication that both tests involve 


apip ition to serving as 
oth i Y» the alternate-form met 
t a, components of reliability. pre 
00 ne sampling of content. W hen it is 
i ntent, such as the two lists of spe 
much the same scores on both, it is ant 


84 


Basic Principles of Measurement and Evaluation 


relatively little error due to the sampling of content. If the test is not 
well standardized in the sense that the instructions are unclear, or the 
rules of testing are not explicit, this would result in different scores 
on the two tests. If multiple-choice tests are used and guessing intro- 
duces a component of unreliability, this can also be determined by 
correlating the two alternate forms. If the purpose is to study long- 
range stability, a number of alternate forms can be administered over 
a period of years. 

The alternate-form method of measuring reliability is the ideal 
because it measures more of the sources of reliability and measures 
them better than any other method which is used. If it were not for 
practical difficulties, the alternate-form method of measuring reliability 
would be used in most instances. This is the measure of reliability 
which usually is, and should be, reported in the manuals for com- 
mercially distributed tests. 

The alternate-form measure of reliability is not used in some studies 
because of the practical difficulties in constructing two forms. It is 
often difficult to make up one good test, and it is almost twice as dif- 
ficult to make up two. Even if alternate forms are available, it is some- 
times excessively time consuming, or very difficult, to get people back 
for a second testing. This usually is not the case in schools, where 
students are more readily available for a second testing. But the dif- 
ficulty does often occur in industry and in military settings, where 
men often only are available for testing on one or two days and then 
must leave for other jobs and assignments. Because of practical dif- 
ficulties, the following approximate measures of reliability are often 
employed, 

Retest Method. Rather than administer two alternate forms of a 
test, the same test can be given on two occasions. The correlation 
between two such repeated administrations of a test is called the retest 
measure of reliability. 

There are two important disadvantages in using the retest method. 
The first is that the obtained reliability coefficient will reflect none of 
the error due to the sampling of content. This is because the content 
on both occasions is the same. The second disadvantage is that the 
individual's memory of his answers at the first test administration is 
quite likely to influence the answers which he gives on the second test 
administration. He is even likely to make the same kinds of guesses 
and tend to perpetuate the same relative amount of good or bad luck 
that he had on the first test administration. Memory works to make the 
two sets of test scores correlate highly; consequently, the reliability 
coefficient is usually an overestimate when determined by the retest 
method. For example, if the retest measure of reliability is found to 
be .90, it would ordinarily be the case that an alternate-form measure 


85 
Correlation and Reliability 


would be less, say .85 or .80. Memory is less influential as the interval 
between testings is increased. But during the two-week’s to one-month’s 
time in which it is advisable to complete both testings, memory is 
likely to be a strong factor, thus, the retest method will often provide a 
substantial overestimate of what would be obtained from the alternate- 
form method. 

The primary reason why the retest method is used is to save the 
labors of making up an alternate form. The most proper use of the 
retest method is in studying long-range stability, where, for example, a 
period of at least a year intervenes between testings. 

Subdivided Test Method. Instead of making up two alternate 
forms, a compromise procedure is to obtain part scores for different 
sections within the same test when it is administered on one occasion 
only, The most popular of such procedures, referred to as the split- 
half method, is to give the student one score on all the even-numbered 
questions in the examination and another score on the odd-numbered 
questions. The two halves of the same test can then be thought of as 
approximations to alternate forms. Scores from the two halves can be 
correlated to obtain an estimate of the reliability, after which a statisti- 
cal correction must be made to estimate the reliability of the whole 
test, not just of the half tests. (The correction is presented and dis- 
cussed in Appendix Ci) 


D,.. „ 4 
Practical considerations have caused the subdivided test methods to 
In many situations it is either too 


be used as extensively as they have. 
expensive to compose alternate forms or it is not possible to obtain the 
same students for a second test administration. However, there are 
some definite cautions that should be heeded in the use of the sub- 
divided test method. Although the method determines some of the 
content, it does not do this as well as the 
alternate-form method. Because both halves of the test are given at the 
Same time, the subdivided test method shows none of the error due to 
instability over time. For these reasons, the subdivided test methods 
Usually give an overestimate of the reliability. 

It is particularly misleading to use the subdivided test methods on 
a highly speeded test. Such a test would be, for example, one in which 
students are asked to complete as many simple arithmetic problems as 
Possible in a short period of time. If the problems are very simple, it 
is a test of how fast the students work. Each student is likely to get 
Most of the problems correct as far as he goes, and students will differ 
Mainly in terms of how far along they are in the problems when time 
is called. This works to make the students’ scores very much alike on 
the split halves. The odd and even scores will be alike on most of the 
items up to the point at which time is called, because most of them 
Will be correct, Also, the odd and even items will be alike beyond that 


error due to the sampling of 


86 


Basic Principles of Measurement and Evaluation 


point because they either would all be incorrect or not answered, The 
split-half reliability estimate obtained on a purely speeded test, such 
as the one illustrated above, would be completely meaningless. For- 
tunately, there are very few tests which are purely concerned with 
speed. Most tests have some time limit primarily as a practical means 
of getting the test completed, but time, as such, often has only a 
trivial influence on scores. However, we should always be wary of 
split-half reliability estimates based on highly speeded tests. 

There are numerous other variants of the split-half method. For 
example, one improvement on the approach described above is to 
administer the odd and even halves on different occasions, say on 
occasions separated by two weeks or more. When this is done, the 
resulting correlation will reflect the day-to-day fluctuations which 
constitute such a prominent source of unreliability. The reliability coef- 
ficient obtained in this way should closely approximate that from the 
alternate-form method. (The same statistical correction mentioned 
above, see Appendix C-7, also would need to be used in this instance. ) 

Internal-consistency Reliability. As was mentioned in the preceding 
section, there are many different ways to subdivide a test, the most 
popular of which is to use odd and even items. Another way of sub- 
dividing would be to place items one through twenty in one half and 
items twenty-one through forty in another half. And still another 
approach would be to place the first three items in one half, the next 
three items in the other half, and so on. A conceptual problem aris 
in that reliability coefficients obtained from these different ways of 
subdividing the test would not all be the same. For example, it is quite 
likely that the correlation between odd and even items would be con- 
siderably higher than the correlation between the first and second 
halves of the test. To circumvent these inconsistent estimates of the 
reliability, statistical measures are available which estimate the relia- 
bilities that would be obtained from all possible ways of subdividing 
the test. These statistics operate on the amount of overlap, or correla- 
tion, among the individual test items. One of the most popular of 
these formulas is discussed in Appendix C-. 

Teachers will probably find the internal-consistency estimates of 
reliability far too complex to use with classroom tests. But since these 
formulas are often cited in test manuals, it is essential for teachers to 
understand what the formulas estimate. Because these formulas work 
within the items on one test administration, they provide no indication 
of one very important source of unreliability—fluctuation in perform- 
ance from day to day. What these formulas do is to provide a conserva- 
tive estimate of the subdivided test type of reliability. The internal- 
consistency formulas are the preferred procedures when the retest 
method is not advisable and when alternate forms are not available. 


87 
Correlation and Reliability 


Increasing the Reliability 


Bis scaler wony about unreliability after it has occurred, it is 
important to eliminate as many sources of measurement 
error as possible when the test is being constructed. For some of the 
1 of error mentioned previously, definite steps can be taken to 
ower measurement error. As was mentioned, the measurement error 
due to guessing can be lowered by using a larger number of alterna- 
tives for each question. The test can be more highly standardized by 
carefully composing the test instructions, ensuring ‘that all necessary 
information for taking the test is supplied to students, and religiously 
following the rules for administration and scoring. Unreliability due to 
the sampling of content can be lowered to some extent by carefully 
spreading the test questions across the important content in the course. 
Errors in scoring on objective tests can be greatly decreased by the 
simple expedient of checking one’s scoring. Errors in scoring essay 
examinations can be greatly decreased by following some of the rules 
Siven in Chapter 7. 

Increasing the Number of Items. After all efforts have been made 
to improve test standardization, the reliability can be raised by making 
the test longer. The increased length will act to reduce the errors due 
to guessing, the errors due to sampling of content, and some of the 
accidental factors in the testing session which tend to increase or 
decrease scores, As tests are made longer and longer, these sources of 
Measurement error tend to average out and permit a more reliable 
result. This is why it was previously said that, other things being equal, 
a long test is a good test. 

Residual Sources of Error. In spite of the tester’s best efforts, there 
Will remain some measurement error, and there will be considerably 
More in some types of tests than in others. Nothing can be done to 
completely remove the measurement error related to day-to-day and 
Year-to-year fluctuations in the individual, The individual is simply not 
completely the same from day to day, nor from year to year, and no 
ot of test construction or statistical manipulation will make him 
50. 


Summary 

ion coefficient indicates the extent to which 
the same thing. In educational research the 
(a) determining the 
(b) determin- 


Essentially, the correlat 
two instruments measure 
Correlation coefficient primarily is useful for 
extent to which a predictor test forecasts a criterion and 
ing the reliability of tests. 
or chance, involved in the results from even the 


There is some error, 
tests the amount of error is so large as to render 


eS acte 5 
St tests; and in some 


88 


Basic Principles of Measurement and Evaluation 


them nearly worthless. Some of the major sources of measurement error 
are (a) fluctuations in the student from day to day, (b) poor sampling 
of content, (c) chance factors that enter into the scoring of tests, (d) 
guessing on multiple-choice tests, and (e) poor standardization of tests. 

Measurement error has several undesirable effects. First, it tends to 
“blur” the relationships between predictor tests and their criteria, 
making correlations lower than if the measurement error were not 
present. Second, it tends to bias scores, high scores being somewhat too 
high and low scores being somewhat too low. Third, it makes it neces- 
sary to consider scores as lying in zones on the test continuum rather 
than representing precise points. 

The ideal way to determine the reliability is to correlate two alter- 
nate forms of the same test. If the correlation is high, it means that 
very little measurement error is present in using the instrument; but 
if the correlation is very low, it means that chance factors are so large 
as to make the instrument untrustworthy, When two alternate forms 
are not available, or for some reason cannot be applied, approximate 
methods are available for estimating reliability. Seldom will teachers 
find occasion to make elaborate studies of reliability, but because 
investigations of reliability are reported for most commercially dis- 
tributed tests of achievement, aptitude, personality, and others, it is 
important for teachers to understand the nature of the problem. 

More important than to measure the amount of measurement error 
that is present in tests is to eliminate as many sources of unreliability 
as possible when instruments are being constructed. Definite steps can 
be taken to lessen many sources of unreliability. The ultimate limit on 
the reliability of tests is set by the tendency for people to change in 
their abilities and personality characteristics over time. Two important 
lessons to be learned from a discussion of test reliability are (a) a long 
test almost always is more reliable than a shorter test, and (b) the 
average score obtained on two tests of the same attribute administered 
on different days is more reliable than the result from only one test. 


Suggested Additional Readings 


Anastasi, Anne, Psychological testing. (2nd ed.) New York: Macmillan, 1961, 
chap. 5, 

Blommers, P. and Lindquist, E. E. Elementary statistics in psychology and edu- 
cation, Boston: Houghton Mifflin, 1960, chap. 3. 

Garrett, H. E. Elementary statistics. New Yor Longmans, 1956, chap. 3. 

Guilford, J. P. Fundamental statistics in psychology and education. (3rd ed.) 
New York: McGraw-Hill, 19 

Nunnally, J. 
6. 

Thorndike, R. L. Reliability. In E. F. Lindquist (Ed.), Educational measurement. 
Washington: American Council on Education, 1951, chap. 15. 


6. 
. Tests and measurements. New York: McGraw-Hill, 1959, chaps. 5, 


part II 


Construction and Use 


of Teacher-made Tests 


In terms of sheer numbers, students see many more 
teacher-made tests than any other kind. Regardless of 
how well commercially distributed tests of achieve- 
ment and aptitude are constructed and used in schools, 
if teachers did not do a good job of composing and 


applying classroom tests, our educational system would 


turn to chaos. 
Because of the differences in curricula among 
schools and school systems, and because of the 
differences in emphases among teachers, it usually is 
not possible to test accomplishment in individual 
units of instruction with commercially distributed 
tests. Even if it were possible, it would be prohibi- 
tively expensive for most schools. Consequently, 
teachers must construct their own tests; and if grades 
decisions based upon them are to have any 
meaning, the tests must be as well constructed as 
possible, commensurate with the time and energy 
teachers can spare from their many other duties. 
Some concepts, facts, and techniques are essential 
to learn before constructing tests. To explain these is 
the major purpose of this book. In addition to being a 
“learnable” body of knowledge, test construction is, in 
art which cannot be acquired solely from 
textbooks. Like any other art, it must be learned from 
much practice, and particularly from working with 
those who have already mastered it. When teachers 
first start to construct tests, their products are usually 
poor, and they get better (if ever) only after much 
To help structure that experience, the 
ome of the basic 


and the 


part, an 


experience. 
following three chapters will discuss s 
principles in the construction and use of teacher- 


made tests. 


chapter 


Planning the Test 


Max Marshall is a bright and shining new face on the staff of Central 
High. He is young enough to be the son, and maybe even the grand- 
son, of some of the older hands there; and they have given him a 
tough task: teaching five sections of general science. His first month 
at Central has been marked with a number of casualties: sixteen test 
tubes were broken, three students burned their fingers on bunsen 
burners, one girl ruined a dress with a splotch of acid, and, finally, to 
climax it all, the last laboratory demonstration in electrical resistance 
knocked out all the lights on the second floor for a whole afternoon. 
As if these troubles were not enough, Max encounters a new obstacle. 
Looking through the textbook, he sees that the first quarter of the 
subject soon will be covered, and, consequently, it will be a good 
time to examine students on what they have learned. Max has been 
so busy keeping test tubes clean and keeping one chapter ahead of 
his young Einsteins that he had almost forgotten about tests. 

The rest of the afternoon Max worries about how to test his students, 
and he is not as fast on his feet as he usually is in preventing the 
acid from spilling and the sparks from flying. As Max remembers from 
ement and as he can see now, he 
lier in the semester and prefer- 


his course in educational measur 
should have planned his test much ear 
ably even before the semester began. 

` Good tests do not simply happen as a product of last minute inspira- 
tion. Good tests are planned in detail well in advance, and the plan 
for the test is closely interrelated with the over-all goals of the course, 
the classroom instruction, the readings in the text, and the laboratory 


exercises, 
Numerous questions occur to Max. What type of test item should 
? How long should 


l use? How much of the text should the test cover? 
the test be? Should the test cover only the text and classroom discus- 
sions, or should it also cover the laboratory work? How many tests 


91 


92 


Construction and Use of Teacher-made Tests 


will I have this semester? Will semester grades be based only on the 
tests, or should I count the laboratory exercises and class projects? 
Because Max has not thought out these questions much earlier, it will 
be rather difficult for him to coordinate his tests with the instruction 
and with the other means of evaluating over-all performance. Because 
he did not explain the schedule of testing, and the nature and scope, 
to students at the beginning of the semester, they are confused as to 
how to divide up their time, what to emphasize and what to de- 
emphasize, and what the standards are for successful performance in 
the course. That evening Max fishes out his textbook on educational 
measurement from beneath unpacked fishing gear, trading stamps, and 
past issues of National Geographic. According to the chapter on 
“Planning the Test,” and as Max now remembers, the purpose of a 
test is to supply a number of types of information. 


Information Supplied by Classroom Tests 


The construction, administration, and scoring of tests are some of 
the most difficult and time-consuming chores of teaching, Many 
teachers who otherwise do excellent jobs fail miserably when it comes 
to preparing and using tests. In a popularity contest, tests would rank 
somewhere below castor oil in the affection of students. If tests hold 
difficulties for teachers and are unpleasant to students, why then do 
we use them? We use them because we have to. They supply some 
very important information that would be difficult, if not impossible, 
to obtain by any other means. 

Information for the Teacher. Tests primarily are helpful to teachers 
in determining the extent to which students are meeting the objectives 
of a unit of instruction. Such objectives always do (or at least should) 
exist either explicitly stated in a course outline or held in the teacher's 
head. For example, in teaching number skills some of the goals are to 
have students master principles concerning the use of dollar signs, 
carrying in addition, and dealing with remainders in division, Both 
teacher and students live by these objectives. Only to the extent that 
the average student meets them can the teacher feel satisfaction with 
the instruction as a whole, and the progress of individual students is 
largely judged by how well they perform with respect to the objec- 
tives. Tests are very helpful because they supply one of the most 
important sources of information as to how well students are mecting 
the objectives of a unit of instruction. 

One of the primary reasons for using tests is to let the teacher know 
how each student is doing. Otherwise he cannot make intelligent 
decisions about promoting and failing students, sectioning, providing 
special help to students in difficulty, and producing stimulating mate- 


93 


Planning the Test 


rial to studer r g i : r 4 
information pest 5 0 e 1 15 1 
test to represent a comprehensive sam kee the im 1 11 ie 
the course. If the test is sl fad tow 0 S ar “3 15 4 Si ie 
subject he 5 is S an ed toward sore particu ar aspect of the 
8 „ ae detriment of other important areas, then it 
serve as a poor basis for making decisions. If the test concerns 
a pies aspects of the subject matter rather than more important 

S, isions based upon it would be faulty. 

À In addition to providing information about the progress of particular 
students, the test supplies information about how the class as a whole 
is progressing. If the average performance is much worse than the 
veir aga this might indicate either that the teacher had 
OTEN ar too much of the students or that the program of instruc- 
as not working well. 
1 5 provide the teacher with some clues about what he 
teaches eac nes, what he emphasizes, and what he values. If the 
will look carefully at the content of his tests, he will see what 
types of subject matter emphases he considers important. The teacher 
may not realize what values he has with respect to various aspects of 
the subject matter until he sees what he actually places in his own 
s in the course should be consciously fol- 


tesis 
i sts. Because the emphase: 
owe 1 
ved by the teacher, and because they should be imparted to 
arly in the term. 


he this information should be available quite e€ 
ae why we said previously that Max Marshall made a mistake by 
ae until immediately before testing time to give careful con- 
sideration to the construction of a test. 

The results from a test provide the teacher with some indications 


of what he actually has taught and how well the text and special 
If students do much better 


exercises 
ercises have covered the subject matter. 
test than on others, this is an indica- 


instructed in the various aspects of 
the subject matter. For example, if in a test of number skills, students 
do relatively much better in multiplication problems than in division 
en this gives the teacher some clues about the adequacy of 
the over-all instruction. 

Information for Students. In addition to the information supplied 
o teachers, tests also supply very valuable information to students; 
te a w not they profess ig like taking =e Len am 

? ain is invaluable. Students often are quite unsure about how 
e iey are doing in a unit of instruction. Few of the A students 
nk they are failing, and few of the failing students think that they 
ee 5 work. But aa 1558 hon students ais meee 

=o sir actus 7 ing. 8 ants W receive go d grades 01 
test will 1 a 1 ins en ur habits ae De 5 

a gec o con inuc 


on s a a 2 

t some types of questions in the 
ic 5 
m of how well they are being 


94 
Construction and Use of Teacher-made Tests 


study that brought them success, Students who do poorly on tests are 
warned to work harder, work differently, and seek help from others. 
Over a period of years students slowly learn the kinds of things they 
can do well and the kinds of things at which they usually do poorly. 
Such information is invaluable to the student in planning his future. 

Tests also provide students with information about what the teacher 
values and actually intends to stress in the unit of instruction. No 
amount of hallowed words by the teacher will convince students. They 
craftily wait to see what he puts on the examination, and henceforth 
they study the kinds of things which he has emphasized. If a teacher’s 
test in tenth grade algebra usually contains numerous problems on 
linear equations and few, if any, on quadratic equations, students will 
soon catch on and emphasize the former rather than the latter type 
of problem in their study, If in a course in American history the 
teacher emphasizes the events relating to the Civil War in his test, 
students will catch on and so concentrate on this in their readings. 
After a teacher has been at a school for some time, older students pass 
along to younger students the lore about what a particular teacher 
emphasizes in his tests. Such information which students receive from 
taking tests can be either helpful or detrimental, depending on the 
quality of the test. As will be discussed in more detail later in the 
chapter, teachers should carefully think out what they value about a 
unit of instruction and what they intend to point up in the instruction. 
If these emphases are made prominent in tests, then this communi- 
cates to the student in a very concrete way the emphases which the 
teacher values, If, on the other hand, the teacher verbalizes one set 
of standards for emphasis in the subject matter and includes different 
kinds of items and questions on the test, then the teacher is fooling 
himself, and students are moved to study what often are trivial aspects 
of the course, e.g., the memorization of names, places, and dates in a 
course on American history. 

Information for Parents and Others. Test results also supply very 
valuable information to parents regarding how well their sons and 
daughters are doing in particular grades. Parents often are surprised 
when they find out how well or how poorly their sons or daughters 
are progressing. Students at least have some basis of comparison; they 
are in the classroom, know how often they succeed and fail with 
particular kinds of exercises, and see how well other students do. But 
parents are lacking these guideposts, and they often are very poor 
judges of the progress of their children. Teachers sometimes fail to 
see how difficult it is for parents to judge the progress of their chil- 
dren. The teacher who has taught the fourth grade for three or four 
years has seen many children of approximately the same age and 
witnessed their efforts to learn standard materials. If a child is unable 


95 


Planning the Test 


10 read a passage in a standard text or to perform adequately with 
certain types of number skills, the expert eye of the teacher soon 
aera a difficulty. But teachers often fail to realize that parents 
ih 5 e en one child, and if more than one, they probably have 
aie i oe le fourth grade. All that the parent can rely on is some 
7 ee stints of how he did when he was in the fourth grade or 
ilies a 15 55 s little girl did when she was in the fourth grade. Even 
saul Ks nave several children and watch them progress through 
dient often forget how well each performed at each level. Conse- 
Yy, tests provide very useful information to parents as to how 

Well their children are doing in school. 
Pe ae find test results much easier to discuss y 
ne type of evaluation. If, in conferring with a parent, the 
tells Billy’s mother that he is having great trouble in spelling, 
his information to the parent if 
able. If no tests are avail- 
is likely to say to herself, 


with parents than 


— 0 3 easier to communicate t ; 
dea illy's spelling: test results are avail 
able, the parent is likely to be skeptical; she 
Billy spells as well as most other children,” or “I think he spells 
Well.” But when the parent sees that Billy has consistently scored near 
the bottom of his class in spelling, and when she sees that he gets no 
eee than four or five words correct out of a lis 
nvincing. 
i supplied by classroom tests helps the parent to help 
Broi “9 el The over-all progress of the child over a period of years 
The 15 information that helps the parent in planning for the future. 
child who does only mediocre or poor work throughout the ele- 
ae and high school grades is a very poor bet for college, and 
ent and parents should be warned about this years in advance. 
2 the other hand, the student who always does A work probably has 
the energy and the capability to go far, and if parents know this in 
advance, they can plan and save to help the child go on for more 
Schooling, A 
at formation from tests helps parents to structure the child’s study 
Ue ome, Working on the assumption that the child has been doing 
tress in school, the parents may have been quite lax in insisting 
5 and helping with homework. After parents learn that the student 
Donas poorly, either over-all or in special areas of instruction, this 
ides information as to how the child can be helped at home. 

In addition to the information supplied to teachers, students, and 
gs. many other persons will obtain information from teacher-made 
8 88 the grades that depend in part upon them. The permanent 
exam e ot students will be inspected and used by many people. For 
to oe Ss one of the important things for the high school counselor 

sider in providing career guidance to a student would be the 


t of twenty, this usually 


96 
Construction and Use of Teacher-made Tests 


grades made in specific units of instruction (which are based in large 
measure on results from teacher-made tests). If the student’s record 
shows that he has made very good grades in mathematics and science 
topics, and if his grades agree with tests of aptitudes and interests, 
the counselor might inform the student that his abilities and interests 
are such that he might succeed in and enjoy life as a chemist, physicist, 
or engineer. Principals and teachers use student records to make 
decisions about grade placement, remedial instruction, counseling, and 
others. In addition grades made by students in particular units are 
useful in many different types of educational research. 


Outline of Objectives 


To help in the instruction of students and to help in the formula- 
tion of tests, it is good for the teacher to outline the objectives for a 
unit of instruction. In the outline will be listed the major concepts, 
facts, and skills that constitute the teacher's objectives for the learn- 
ing situation as a whole. For example, a portion of an outline for the 
teaching of, and testing of, number skills in the fifth grade is as 
follows: 


A. Addition 
1. Two-, three-, and four-row problems 
2. Problems requiring, and others not requiring, carrying 
3. Decimal points and dollar signs 
B. Subtraction 
1. Three-, four-, five-, and six-column problems 
2. Problems requiring borrowing 
3. Decimal points and dollar signs 
C. Multiplication 
Memorization of multiplication table through 12 times 12 
Multipliers of one, two, three, and four digits 
Multipliers of 10, 20, 100, 200, ete. 
Problems requiring carryin 
Decimals and dollar signs 
D. Division 
1. Divisors of one, two, and three digits 
2. Quotients without, and others with, remainders 
3. Decimals in divisors and/or dividend 
E. Word problems 
1. Requiring only one operation, e.g., division 
25 Requiring two operations in sequence, e. g., first division, then 


— 


w be 


S1 


subtraction 
3. Requiring answers in fractions 


97 
Planning the Test 


The reason that such an outline should be composed early in the 
term is that it will prove useful not only in generating tests but also 
i guiding the instruction. Inevitably it is the case that, near to the 
time when a test is to be administered, the teacher will see that parts 
of the outline have not been sufficiently covered, e.g., that few or no 
word problems have been studied which required two operations in 
sequence, The teacher must then use the remaining days of classwork 
to ensure that the outlined subject matter has been adequately cov- 
ered before the test is administered. 

The way to compose a test is to write 
and each subsection of the outline. How to write good items will be 
discussed in detail in the next chapter. Here we will be more con- 
cerned with the outline and with illustrating how outlines can be 
translated into tests. Some sample items are as follows: 


good items for each section 


For Section C, subsection 4: 


286 9284 
X48 xX 36 


For Section D, subsection 1: 


3 | 6921 810 222 9102 


For Section Æ, subsection 2: 
Billy picked 24 apples and gave half of these to his mother. From 
his remaining half, he gave 4 apples to a friend. How many apples did 


Billy have left for himself? 


In the above outline of fifth-grade arithmetic, there is a total of five 
Sections, In these five sections is a total of seventeen subsections (three 
in all, except Multiplication, which has five). If the teacher composed 
only one item relating to each subsection, the test would contain 
seventeen items. However, many of the subsections relate to two or 
More items, e.g., subsection one under Division: “Divisors of one, two, 
and three digits.” Consequently, to completely cover all the subsec- 
tions, the test would have to contain three or four times as many 
items as seventeen, which might be too many items for the children to 


take in the time available. 

The emphases placed among the 
Ways. First, the importance which the 
sections and subsections will partly determine the emphases and the 
number of related items which are composed. In the example above, 
the teacher would probably place the most emphasis on word prob- 
lems (section E) because, typically, this is one of the areas of con- 
Centration at the fifth-grade level. Consequently, the teacher would 


subsections are determined in two 
teacher attaches to the different 


98 


Construction and Use of Teacher-made Tests 


decide to compose nine items for section E, three for each of the 
subsections. In contrast to the emphasis placed on word problems, 
some other sections and subsections may be given relatively slight 
emphasis. If the class as a whole is doing well in arithmetic, it would 
be a waste of time to compose items relating to some of the simpler 
types of problems, e.g., multiplication problems that do not require 
carrying and addition problems with only two rows of numbers, For 
these and other relatively unimportant parts of the outline, few or 
no items would be composed. 

In addition to the purposeful placing of emphases, the number of 
items composed for different parts of the outline are (or rather should 
be) allotted on a random basis. The outline of arithmetic above is 
relatively simple, and it would be possible to cover all the parts in 
one relatively long test. When more complex outlines are required, 
as would be the case with eighth-grade geography and high school 
physics, it is not possible to adequately cover even the important 
subsections of the outline in the test. Consequently, the teacher must 
compose more items for some parts of the outline than for others. 
This is all right as long as these “random” emphases do not in fact 
slant the test much more toward one part of the subject matter than 
to others, e.g., many items on multiplication and none on division. 
If teachers consistently slant tests toward certain portions of the 
subject matter, students will catch on and correspondingly slant their 
study, 

Such random choosing of items is best accomplished if the teacher 
has a large store of items available, obtained from tests given in 
previous years. If these are placed on file cards (which has many 
advantages), all those relating to particular sections of the outline, 
e.g., word problems, can be shuffled and the desired number dealt out. 
This would approach true randomness, which, because the test is an 
assessment, is good. Teachers who are new at instructing a particular 
subject will probably not have a large store of items available. Then 
it is very helpful if the new teachers can borrow stocks of items from 
other teachers who have had considerable experience in dealing with 
the subject matter. 

If teachers do not have their own stock of items and cannot borrow 
them from other teachers, they must try to be as “random” as possible 
in the final selection of items (keeping in mind that the number of 
such items allotted to major sections is primarily determined by the 
judged importance of the sections). There is no way to tell a person 
how to “be random,” but essentially what is meant is to scatter the 
items around throughout all possible problems and questions. After 
tests are composed, it often is easy to see that the items are overly 
slanted toward one or more aspects of the subject matter, e.g., too 


99 


Planning the Test 


Pa questions about electricity in the science examination and too 
ew questions about agricultural regions in the geography examination. 
Pea pan against overly slanting the test toward particular 

ERS 1e subject matter is to ask fellow teachers to look through 
the intended test and to make suggestions for additions and changes. 
It often is easier for our friends to notice such overemphases. Fellow 
teachers should also be consulted for many other problems in com- 
Posing reliable, comprehensive tests. Admittedly this sometimes is a 
painful experience for beginning teachers, but it is necessary for the 
growth of professional skills. To further illustrate the development 
and use of subject-matter outlines, in the following sections outlines 
of objectives will be shown and discussed for eighth-grade geography 


and for high school physics. 


Outline for Eighth-grade Geography 

Fifth-grade arithmetic was chosen as the topic for the first illus- 
tration because, comparatively speaking, arithmetic is easy to outline 
and easy to test. There are some definite divisions of the subject 
Matter, and it is relatively easy to compose problems relating to these 
subdivisions, Here we will consider a subject matter, geography, which 
and which is more difficult to adequately 
that the teacher is planning a two- 
continent of Africa, and 


is y ; A 
ne not quite so easy to outline 
id In the example imagine 
rey period of instruction relating to the 
a oe c $ . à 
at the end of that time he intends to compose a test. 


A. Principal racial, ethnic, and religious groups 
1. Names of groups 

Principal region each occupies 

Relative social and economic positions 


13 Conflicts and changing relations 
Major topographic and climatic regions: North Africa, Sahara 
g 


region, tropical rain forest, temperate Southern Africa 


1. Names of countries in each region 
and rainfall 


1 


A w 


2. Types of rivers, lakes, 
3. Elevation, topography, types of soil 
4. Density and nature of populations 
5. Indigenous animals and plants 
5 Agriculture 
1. Major products of different regions 
2. Products that “grow wild” 
3. Methods of cultivation and merchandising 
4. New products and methods being introduced 
D. Commerce and industry 


100 


Construction and Use of Teacher-made Tests 


Degree of self-sufficiency 
Transportation 

Prominent mineral resources 
. Typical industrial products 
Major exports and imports 

E. Political subdivisions 

Names of major countries 
Different forms of government 
Movements toward self-government 
Political alliances 

Problems of new nations 


90 4 — 


Oi 


K 


For the arithmetic test, it was relatively obvious what types of 


problems should be used. For the test of geography, a much wider 
selection of topics and item forms is available. Following are some 
types of items that could be used to turn the outline into a good test. 


For section A, subsection 3: 
One of the most primitive groups of people in Africa is the 
a. South African Boer 
b. Pygmy 
c. Watussi 
d. Egyptian 
e. Berber 
For section D, subsection 3: 
Southern Africa leads the world in the mining of 
a. coal 
b. iron ore 
c. silver 
d. diamonds 
e. bauxite 

For section D, subsection 5: 

Name five principal export commodities from Africa. 

For section B, subsection 2: G 
The longest river in the world is in Africa. It empties into the Medi- 
terranean Sea. Its name is the 

For section D, subsection 2: 

The camel is used for transportation in certain regions of Africa. State 
three physical features of the camel that make him ideal for transporta- 
tion in desert regions. 

For parts of sections A, B, C, and D: 

Describe the typical village farm in Central Africa. Include in your 
description 

a. the crops often grown 

b. methods of farming 

c. trading of farm products 


101 
Planning the Test 


d. types of homes and buildings for animals 
e. danger from wild animals 
f. education, work, and play of children 


" 55 25 be seen from the example above, there are many different 
5 1 ; items that could be used for the test of geography. The merits 
. other types of items will be discussed in the next chapter. 
cries es Before getting to that subject matter it should be understood 
thaws 15 yin the book that any of these types of items and most of 
mr 2 will be discussed in the next chapter can be made into 

ests. The essential consideration is that items be carefully 


th A s 
ent out and carefully composed regardless of their specific forms. 
nother teacher might have composed a somewhat different outline 
lifferent items with respect to the 


fi 

1 that above and composed € 
line, but that is not the important point. The important point is 
outline and that the outline 


the x 
E there should be a comprehensive 
should be skillfully translated into a broadly representative collection 


of test items. 

i In the following section one more example 
ne for 5 N 
of a constructing a classroom test, then in th 
t this chapter some general principles will be st 

Cacher-made tests. 


will be given of an out- 
e subsequent sections 
ated for constructing 


Outli 2 a 
utline of Objectives for Physics 
4 * . 
A. Measurement in physics 
1. Types: length, m volume, density, and time 
English and metr 


International standards 
4 Accuracy of measurement 


stems 


or 


B. Forces in liquids 
1. Relations between force and pressure 
2. Influence of gravity and depth on pressure 
3. Archimedes’ principle 
4. Pascal’s law 
5. Underwater vessels and equipment 
8 5 City water supply 
Forces in gases 


I. Relations of gases to fluids 

2. Air pressure 

. Siphons and pumps 

4. Composition of atmosphere 

5. Biological effects of air and air pressure 


102 
Construction and Use of Teacher-made Tests 


6. Weather and weather forecasting 
7. Boyles’ law 
8. Principles of flight 
D. Laws of motion 
Vector and scalar quantities 
Components of force 
Balanced and unbalanced forces 
Relations among speed, time, and distance 
Newton's laws of motion 
Gravitation 
Laws of freely falling bodies 
The pendulum 


S D g WW He 


In addition to using these subject-matter classifications, the teacher 
decided to compose items to test three types of knowledge: (a) 
knowledge of simple facts and definitions, (b) understanding physical 
laws, and (c) applications in daily life. Examples of items which 
would fit the outline are as follows: 


For section A, subsection 2: 
The major advantage of the metric over the English system of meas- 
urement is the 
a. higher accuracy of measuring very small objects 
b. inexpensiveness 
c. ease of changing from one unit to another 
d. familiarity of the average American with the metric system 

For section B, subsection 2: 
Force on a submerged object is 
a. exerted equally in all directions 
b. is exerted more on the bottom 
c. is exerted more on the top 
d. is exerted in proportion to the density of the object 
e. is exerted in proportion to the weight of the object 

For section B, subsection 3: 
Why does oil float on water? 

For section C, subsection 4: 
More than three-fourths of the earth’s atmosphere is composed 
0 — 

For section C, subsection 8: 
Why are rockets able to go much higher than balloons? 

For section D, subsection 2: 
An airplane is headed due north at an airspeed of 200 miles per hour. 
A wind is blowing from west to east at 30 miles per hour. Diagram 
the component and resulting forces. What are the approximate ground 
speed and direction of flight of the plane? In order to arrive at a city 
directly north of his point of take-off, how would the pilot have to 
alter his course? 


103 
Planning the Test 


For section D, subsection 5: 
Describe Newton’s three laws of motion. Illustrate each with how the 


ball behaves in a game of baseball. 


Special Problems 


ao it someone gave Max Marshall an excellent outline of course 
be and an excellent set of test items to match, there would still 
number of questions in his mind as to how the final test should 
80. Following are some of the major issues that he might encounter. 

Timing the Test. Nearly all tests have a time limit, if for no other 
er so that other activities can begin. 
a take as much time as they like, 
students will complete the 


fee to get the testing ov 
5 aoe are told that they car 
5 : 10 What will happen is that half the lete 
=. n thirty or forty minutes, most of the rest will finish within an 
“a and a few will go on interminably. Since the total time of test- 
g is determined by the last individual who voluntarily quits, giving 
an unlimited amount of time often means that the total testing time 
asts far longer than is intended. On nearly all tests a finite, if gen- 
erous, amount of time is allotted, and all students are required to 
Stop at that point. Most teachers are chary of using highly speeded 
in which speed per se rather than a deeper understanding is the 
a thing being tested. Consequently, we are prone to give students 
ti time than they actually need on tests. Studies have shown that 
ù ss time limits are quite restrictive, more restrictive than would 
'sually be the case, that speed per se has relatively little influence on 
results, A good working rule is to try to set a time limit such that 90 
Per cent of the students will feel that they have ample time. The other 
= per cent probably would feel that they were rushed even if they 
re given four times the amount allotted. 
sa setting the time limit for the test, the teacher must compromise 
the standards. In order to make the test a comprehensive coverage of 
i ap matter, and thus to make it reliable, it is necessary to in- 
a as many items as possible. On the other hand, in order to pre- 
a the inclusion of too many items, and thus rush students exces- 
vely, a ceiling on the number of items that is 
Ipful in judging what would be the ideal 
say, there are five alterna- 


used Ae is necessary to put 
Ronn ; everal rules may be he 
1 On multiple-choice ie = sae or k 
$ ach item. students » higher elementary grades a 
gh, for each iter, students in the lens Joch items inn period P 
l an usually complete a eas y items in a period o 
Y minutes without being excessively rushed. If instead of multiple- 
a l in which each answer is restricted 
idents can easily complete six 
minutes. Of course, where an 


Choice į 

la Ice items, essay items are usec 
ï ? a 

5 no more than one-half page, most sti 
More such items in a period of fifty 


104 


Construction and Use of Teacher-made Tests 


extensive amount of writing is required, children below the fifth or 
sixth grade would need to proceed more slowly. It usually would be 
the case that forty multiple-choice items or six half-page essay items 
would provide reasonably comprehensive coverage of one month’s or 
six week’s content in most subjects. 

Physical Setting. To say that tests should be given in quiet, well- 
lit rooms where students have ample space is to offer the type of 
aphorism that will make readers yawn. However, in practice teachers 
are often the worst offenders of the rule. While students are busily 
taking tests, teachers (at least inexperienced ones) are often guilty of 
pencil tapping, foot thumping, pacing back and forth between the 
window and the door like caged lions, noisily opening and closing desk 
drawers, and indulging in stage whispers with visiting fellow teachers. 
Students are quite attuned to what teachers do and say, and nothing 
is so distracting as for the teacher to move or make a noise. In most 
situations it is best for teachers to remain at their desks and stay alert 
to the need to answer students questions. If it is necessary to move 
about the room, this should be done as quietly as possible. 

In addition to the obvious rules that anyone would employ about 
the setting for testing, the time of day is important. Indeed one could 
make the argument that there is no good time for testing. When chil- 
dren first arrive in the morning, they are still waking up, after recess 
they are too tired, before lunch too grouchy, after Toneh too sleepy, 
and near the end of the school day too itchy. Seriously, it is wise to 
test during those periods when children are at their best and neither 
so hungry nor so full, so restless nor so tired as to not be able to mani- 
fest their best performance, Although it is not always possible to stick 
to such an ideal schedule, the hours between nine and eleven and one 
and two are usually the best times for testing. 

Announcing Tests. With a look of fiendish delight, Miss Gail said 
“I gave my biology class a pop quiz this morning.” Although all 
teachers can appreciate the pop quiz as one of the few remaining 
ways of hitting back now that corporal punishment is going out of 
fashion, Miss Gail was probably not doing the best thing for her stu- 
dents. The bulk of the argument is on the side of announcing tests 
well in advance and preferably even giving students a complete sched- 
ule of testing at the beginning of the term. (This rule is more ap- 
plicable to students in junior and senior high school. Students in the 
lower elementary grades would either forget or not particularly care 
when tests were to be given.) 

The announcement of tests is part of the general principle that, 
many other things being equal, instruction is most effective when stu- 
dents are as well informed as possible about the course, the teacher's 
intentions, the merits of different types of study, the relative emphases 


105 
Planning the Test 


to b ve , ; 
e placed on different aspects of subject matter, and the standards 


5 coe that will be applied. To not announce tests is to not pro- 
os ents with information which they need. Although the pop 
da 55 a part, be justified on the grounds that it tends to keep stu- 
Sales ee dee ball who otherwise would wait until the last minute to 
the ee. ki are better arguments for fully informing students about 
Wilken As testing. Even the most conscientious student must plan his 
à erms of a schedule of testing. 
Final Details. A good test consists not only of a set of items but 


also of a 8 
of a set of rules and procedures that must be made clear to stu- 
nstitute a very important 


pee ae The test instructions co j 1 ; 
materials procedures Students should be told in advance if special 
told ino required in the testing. In this regard students should be 
Paper 8 5 or not it is to be an open book exam, whether scratch 
will be re ne needed, and whether rulers and other mechanical aids 

3 equired. 
ie mo more than twenty or thirty students are to be tested in one 
brace 93 preparations for testing are usually not elaborate. In con- 
ene. relatively large numbers of students are to be tested either 
e or in successive groups, then it is very important to 
exam 15 plan the testing routine. This would be the case when, for 
9 88 S either a teacher-made test or a commercially distributed test 
8 Siig administered to all the children in the sixth grade in a 
tom 5 One of the best ways to ensure that the testing routine is 
me SISO role play the testing. This is done by conducting a mock 
will 11 an which fellow teachers serve as students. The teachers who 
Seti minister the test go through all the testing routine starting with 

ng instructions and ending with the collection of finished examina- 


tions Soe 
Ons. The “students” follow the directions of the test administrators 


an j dene ; . 
d take notes on places where the test administration can be im- 


Pro > 2 . 222 . 2 

tee ved. After the mock administration, all the participants join ina 

woe of the testing routine and make suggestions for improve- 
ents, 8 

evitably produces a 

he testing routine. The following are 

ld be obtained: (a) place tests and 


jon ive playing of the test administration in 
typical z} ideas for improving t 
answer thes suggestions that wou j ; 
ver sheets on individual desks beforehand rather than take the 


119 to pass out these materials after students are seated, (b) place a 
ese re in the front of the nts can see how much 
stude las elapsed, (c) clarify the instructions regarding whether or not 
oie can go back through the test and answer questions that they 
cils 1 the first time around, (d) obtain a supply of sharpened pen- 
dente, the pencil points that will inevitably be broken, (¢) tell stu- 

S to leave their test booklets and answer sheets at the rear of the 


room so stude 


106 
Construction and Use of Teacher-made Tests 


room as they file out rather than try to bring them to the head of the 
room, (f) more fully explain to students how the special answer sheet 
is to be used, and (g) inform students that they can use the an 
of pages in their test booklets as scratch paper. By role playing the 


test administration, teachers will get many ideas for improving the 
final testing. 


A Realistic Outlook on Test Construction 


A point that will be made many times throughout this book is that 
it would be silly for us to recommend the use of overelaborate, highly 
time-consuming procedures of test construction, administration, and 
interpretation. Of course teachers have thousands of things to do in 
addition to the construction and use of tests. The major purpose of this 
book is to show teachers how they can best exert their energies in the 
construction and use of tests, but this must nece 
relatively small portion of the teacher’s time. 
be foolish to recommend the deve 
complex outlines of subject matte 
tests. 

Because of the Practical diſſicultie 
pect each teacher- made test to be 


ssarily be done in a 
Consequently, it would 
lopment of extremely detailed and 
r or the use of very long and involved 


s involved, it is unrealistic to ex- 


a paragon of test construction. For 
this reason it is easy for both teachers and students to take the results 


of any one test too seriously, Much more important than the result 
from any one test is the cumulative record of 
subject matter, If, for example, from the 
student has a great deal of difficul 
ficiency in mathematics is well 
can be taken. To reach import: 
basis of only one teacher-made test would be 
be for the prospector to abandon his cl 
was not brimming with gold. 


a student in a particular 
fifth through eighth grades a 
ty in mathematical topics, the de- 
documented, and appropriate actions 
ant conclusions about students on the 
as unwise as it would 
aim because the first shovelfull 


Summary 


Good tests do not arise from 
desperation, but rather they 
aspect of planning a te 


sudden inspiration or from last-minute 
must be planned in advance, The major 
st is to outline the objectives of a unit of instruc- 
tion. In most commercially distributed achievement tests a great deal 
of care is spent in constructing the outline of content. Although 
teachers will not have the time to lavish such care in planning their 
own tests, they can at least jot down the major areas of content to be 
covered, Actually, the small amount of time spent in constructing an 


107 
Planning the Test 


outline will be repaid by the relative speed with which items can be 
constructed once the outline is available as a guide. 

In addition to promoting the representativeness of tests, the outline 
also provides the teacher with some insights about what he teaches 
and what he values with respect to the subject matter. After looking 
at their own outlines, it is often easy for teachers to see that either 
they are not emphasizing important material in their instruction or that 
they are including inappropriate items in their tests. 


Suggested Additional Readings 


Bloom, B. S. (Ed.) Taxonomy of educational objectives. Handbook 1, Cognitive 
g domain. New York: Longmans, 1956. 
French, W. Behavioral goals of genera 
„Russell Sage, 1957. 
Kear ney, N. C. Elementary school objectives. New York: Russell Sage, 1953. 
Ross, C. C. and Stanley, J. C. Measurement in today’s schools. (3rd ed.) Engle- 
Wood Cliffs, N.J.: Prentice-Hall, 1954, chaps. 5, 6, 7. ~ 
horndike, R. L. and Hagen, Elizabeth. Measurement and evaluation in psychol- 
ogy and education. (2nd ed.) New York: Wiley, 1961, chap. 3. 


l education in high school. New York: 


chapter 


Test Items 


The basic unit of test construction is the item, the individual “thing” 
that is scored. Because total test scores are obtained by adding up the 
item scores, it should be obvious that the total test can be no better 
than the items of which it is compounded. Items take on very different 
appearances in different types of tests—true-false, multiple-choice, 
identification, short-answer essay, and problems. 

In spite of the obvious importance of test items, the inability to com- 
pose good items is the major reason why some teachers do a poor job 
of evaluating the progress of students. One cardinal fault of many sets 
of test items is that they are not broadly representative of the im- 
portant content in a particular unit of instruction, Either they are 
overly slanted toward one or another aspect of the content, or even if 
they are broadly representative, they tap only trivial information, e5» 
memory for miscellaneous facts. How to avoid these pitfalls was dis- 
cussed in the previous chapter, where it was said that the best way to 
ensure a broad representation of the important content is to start with 
an outline of the concepts to be covered in a unit of instruction. f 

Even if the teacher has done an excellent job of outlining the unit 
of instruction, the test which he composes may be very poor. The out- 
line represents an intention to construct a good test; but without the 
patience and skill to translate it into valid items, the outline itself is 
only camouflage for poor evaluations of students. 

It takes patience to construct good tests because, like most impor- 
tant tasks, test construction is a time-consuming pursuit. Because 
teachers have many, many other things to do besides construct tests, 
it would be unrealistic to expect weeks of effort to be devoted to each 
test or to expect each to be a masterpiece of educational measure- 
ment. However, some teach 


ers, even some of those who otherwise do 

an excellent job of teaching, devote a disproportionately small SANS 
i ii 4 N * gvglier o: Th) 

of time to the construction of tests. Either they are cavalier enoug 


108 


109 
Test Items 


to assume that tests are unimportant or vain enough to think that a 
Sood test can be composed in a few minutes’ time. 

The skill of writing test items springs from two interrelated com- 
ponents. First, there is a great deal of technical information available 
on the effectiveness of different types of test items. Some of the most 
important rules for constructing items will be described in this chap- 
ter. If these are not obeyed, it is very doubtful that teachers will con- 
struct good tests. i í 
f In addition to learning and using the technical details of item writ- 
mg, another ingredient is essential: journalistic skill. Item writing is 
Partly an art, in the same sense that golf, bridge, painting, and writing 
books are arts, No art can be learned entirely from a book, but success- 
ful performance (writing test items) is usually built on a foundation 
of technical information ( rules for writing items). 

The art of item writing must be acquired from classroom experi- 
ence, Most teachers learn to write better items after several years of 
Practice; although, in this respect, some teachers defy the time-honored 
Principle that practice makes perfect. Teachers who save copies of 
tests can often look back to earlier years and see that many of their 
items were inappropriate, ambiguous, trivial, too difficult, or too easy. 

_ ver a period of years teachers obtain information about the effec- 
tiveness of their tests from a number of sources. The test results them- 
Selves offer one important type of information. If students do better 
or worse than expected on the whole test, or if students do better on 
Some parts of the test than others, these offer hints for composing bet- 
ter items. (Some simple statistical procedures to facilitate the analysis 
of test results will be mentioned later in the chapter.) : 

_ The reactions of students constitute another important type of in- 
Ormation. Although students are, not generally known to like tests, 
ey become pretty fair judges of the adequacy of examinations. By 
'stening to their complaints, and surprisingly, the praise they some- 
times give, the teacher learns to construct better tests. 

Probably the most important source of information is obtained from 
one's fellow teachers. Items that we think are models of clarity may 
appear ambiguous to our friends. Whereas we may think that our tests 
are not too long, other teachers may use only half as many items for a 
one-hour test. Whereas we may think that we are employing ingenious 
items, other teachers may have even better ideas. A free exchange 
With fellow teachers is one of the most effective ways to help polish 


Ne ar pire 8 
art of writing test items. 


E . 
Ssay and Objective Examinations 
student is given a few questions to 


I . 
n the essay examination the € 
al essay item is: 


uch he responds in some detail. A typic¢ 


110 


Construction and Use of Teacher-made Tests 


Describe the Bill of Rights, considering (a) its place in the Constitution, 
(b) the people in Congress who supported it, (c) its purpose, and (d) 
how it grew out of sections in the constitutions of various states. 


Relating to the same subject matter, an objective item would be: 


The major reason for enacting the Bill of Rights was to 

a. free the slaves in the South 

b. ensure the freedom of individuals 

c. establish law enforcement agencies for the protection of citizens 
d. ensure that all adults could vote in national elections 

e. prevent illegal seizure of property 


Because the essay and the objective item form the backbone of most 
teacher-made tests, it is important to discuss in detail their character- 
istics and their relative advantages. 

The “objective” examination is so-called because the scoring pro- 
cedure can be completely stated in advance of testing. On each item 
the student tries to select the correct answer (usually only one) from 
a prescribed list, or he supplies the one term, name, or date which 
answers the question. It should be realized that such examinations are 
“objective” only in the sense that the rules for scoring are absolutely 
clear, It may be, and it often happens, that the teacher is wrong and 
has designated an incorrect response as being correct. Nevertheless, in 
the objective examination the teacher makes it clear what he considers 
to be correct answers, which offers the basis for forthright discussions 
with students and teachers about the test items. In contrast, on essay 
examinations teachers sometimes have such difficulty in determining 
their own standards for grading that it makes it hard to communicate 
with students and teachers, and an important type of information is 
lost which might help in constructing better “examinations in the 
future. 

Because it usually takes less than a minute for the average student 
to answer each item, it is possible to include many items in the objec- 
tive test (more than 40 in 50 minutes’ time without “pushing” stu- 
dents). In contrast, on most essay tests many fewer items are usually 
employed. The advantage of having more items is that it permits a 
much wider sampling of the content. (For this reason, later it will be 
recommended that the teacher use many short-answer essay questions 
rather than only three or four long-answer questions. ) f 

Previously it was said that one Way to increase test reliability is to 
increase the number of items. For the sake of reliability, the more 
items the better. This is because a test is only a sample ‘of what the 
student knows, and unless it is an extensive sample, the test may do a 
very poor job of assessing knowledge. On essay tests containing only 


111 


Test Items 


5 questions, a student often has the feeling that he 
Ness ee 3 - teacher happened to choose questions with which 
iet 1 a 1 55 57 student with the same level of ability might 
taka Ia 15 nicky, One of the purposes of a good test is to take 
the test ee A ee evaluation, and this cannot be done unless 
sales ol a 4 — sample the important content. From the stand- 
een ent sampling, objective tests and essay tests with many 
Shtain . 5 . have decided advantages over essay tests that 

Obi ig y several long-answer questions. 
for e examinations are often said to (a) get at only the memory 
at 0 11 sa and trivial details, (b) provide no opportunity to see 
tlie 3 organize their thoughts, and (c) measure none of 
leas. af 5 s critical and creative abilities. These are legitimate criti- 
they are a ways ed which objective tests are often constructed, but 
skill of the Prager faults of objective tests. Much depends on the 
ee amin est constructor. The truly skillful item writer can test al- 
will be X ning with objective items. Examples of how this can be done 
fete * own later. The reason why so many teacher-made objective 
teacher — get at more important parts of the content is that the 
5 oes not have the skill and/or the time to compose an ex- 
cellent test, 

Do essay examinations provide an opportunity for students to show 


NOW we 1 A 
well they can organize their thoughts? If well constructed, they 


can a n 
» but often this potential advantage of essay examinations is not 
ents are often not pro- 


Aa In long-answer questions, stud 

ample au) 5 guides as to how they should answer. For ex- 

sance?” m the question, What happened to art during the renais- 

brilliant a many different lines of attack could be made that even the 

teacher sarien is apt not to produce the type of answer which the 

wla Nae Students often must become mind readers to figure out 
e teacher wants. Consequently, they write a hodgepodge of 


thir P 
ngs to ensure that they have covered the topic. 
a distinct advantage to the student 


; 
aoe well, even if he does not know a great deal about the sub- 
2 a If he writes in a clear hand, uses fancy words, and weaves 
lnenad into elegant sentences, it is hard for the teacher not to be 
egible $ e assigning grades. Another student who has an il- 
actual] 1 misspells words, and makes grammatical errors may 
or ph y know more about the topic; but even if the topic is biology 
tests. ysics, teachers are loathe to give such students good marks on 


Lo 
Ong Š 2 5 
ng-answer essay items give 


at the student's ability to reason with 
they often do not. When we 
hat we too often mean 


Do 
subj essay questions really get 
e 3 : 
a matter? They can, but in practice 
s a ere 
udents to explain, analyze, and criticize, W 


112 


Construction and Use of Teacher-made Tests 


is to recite the facts stated in the text or lectures, For example, when 
we ask students in an essay question to “explain” the events leading 
up to the First World War, what we often mean is to recite the se- 
quence of events as it is presented in the text, e.g., the Serajevo 
assassination. 

The objective test is often criticized because it provides students 
with no opportunity to learn to express themselves in writing, which 
is, of course, correct. Does the usual essay examination help students 
to learn to express themselves in writing? It can if examination ques- 
tions properly structure the types of answers that should be given and 
if students are given ample time to respond. However, students often 
are so hurried on essay examinations, and so mindful of the need to 
write a lot, that they seldom turn out elegant prose. Because of this, 
some have suggested that the prevalence of essay examinations has 
contributed to sloppy writing. OF course, it is very important that stu- 
dents learn to write well, but the essay examination is probably not 
the place to learn. Much better is to assign project reports, themes, 
and term papers. In these the student has the time to carefully com- 
pose his ideas, to write well, and to manifest his real critical abilities. 
Such out-of-class papers constitute a very important addition to class- 
room tests. 

Do you usually get very different results from essay and objective 
tests? Suppose, for example, you give your students an essay and an 
objective test, both of which are intended to cover the same material. 
Would they correlate highly? They usually do. Within the limits that 
measurement error permits, two such tests usually correlate quite 
highly. Consequently, even if intuition suggests that the two types of 
tests should produce markedly different results, the research evidence 
makes it quite obvious that the two usually are not very different in 
practice. i 

The worst potential fault of the essay examination, particularly the 
long-answer kind, is unreliability of scoring. If two teachers grade the 
same essay questions, they might disagree markedly about some of the 
responses. One teacher would give a paper C, the other would give 
an A. One teacher would give failing grades to half the class, the other 
would give a high proportion of A’s and B’s. Even if a teacher regrades 
a set of questions after a month’s time (and there are no names on the 
papers), he will disagree somewhat with his prior gradings of some of 
the papers. Fortunately, there usually is enough consistency in the 
grading of essay examinations so that the accumulated average of a 
student over a semester, and particularly over the years, comes to rep- 
resent a fairly reliable evaluation. But we should not kid ourselves 
about the results of only one essay test as it is typically constructed, 
administered, and scored: the results often are not highly reliable. 


113 
Test Items 


To prevent the dire warnings above from leaving readers with the 
feeling that all examinations are no good (objective tests completely 
trivial and essay tests completely unreliable), let us summarize a few 
important points. Above were discussed the faults that often occur in 
classroom tests. Tests can be made much better than they usually are. 
By the ingenious construction of items, objective tests can measure 
almost anything, including the ability to “reason with.” Essay tests can 
also be oriented toward “reasoning with,” and, by following some of 
the rules stated later in Chapter 7, they can be made highly reliable. 
Both objective and essay tests, and many different types of each, can 
be excellent methods of evaluation if they are carefully constructed. 

The decision whether to use objective or essay exams, and which 
Particular type, often depends on practical considerations: the number 
of students to be tested, the particular subject matter, and the prefer- 
ences and skills of the teacher. The practical advantage of the objec- 
tive test over the essay test grows with the number of students to be 
tested. It takes much longer to construct a good objective examination 
than an essay examination, but it takes much less time to score the 
objective examination. If there are only ten students in the class, it usu- 
ally takes less total time to construct and score an essay test. When 
there are as many as twenty students in the class, it is usually time- 


Saving to employ an objective test. If the number of students is as 
(Procedures for scoring 


large as forty, the time-saving is immense. 
objective tests will be discussed in Chapter 7.) 

Some subject matters are easier to test with objective exams, and 
Others are easier to test with essay exams. For example, topics such as 
Seography, history, and physics are easily cast in the objective form. 
Others such as English composition and foreign languages are more 
easily placed in the essay form. 

Because it was said above that almost 
objective items, this does not mean that inexperienced teachers neces- 
Sarily can do that. Test-construction experts, such as those who con- 
Struct commercially distributed achievement tests, are usually quite 
gifted at turning important content into good objective exams. It usu- 
ally requires considerably more skill to compose good objective tests 
than good essay tests. Consequently, until a teacher gains experience 


mM constructing objective items and/or if he feels that he is not suf- 
ciently skilled to compose good objective tests, he is quite right in 
relying mainly on essay forms. It is usually wise for the teacher to fol- 
ow his own judgment and employ the types of items with which he 
els most comfortable. However, it is also wise of a teacher gradually 
to experiment with different types of items in order to broaden his 
repertoire of test-construction skills. 
Better than to exclusively use objective 


anything can be tested with 


or essay items is to combine 


114 


Construction and Use of Teacher-made Tests 


the two, either in the same test or in different tests. One procedure 
is to reserve half of the testing time for objective items and the re- 
mainder for short-answer essay questions. When the results from such 
tests are combined with the results of out-of-class themes, reports, and 
term papers, the total material often provides an excellent basis for 
evaluating the progress of students. 


Types of Objective Items 


In a sense there are as many kinds of objective items as there are 
flowers in the fields, but fortunately, they can be classified into a few 
basic types. The most prominent types are as follows: 

True-False. Perhaps the most familiar type of objective item is the 
true-false, in which the student is presented with a statement to be 
marked as either true or false, A simple example is: 


T F Oil is one of the principal exports of Venezuela. 


The popularity of the true-false item probably is due to the ease with 
which such items can be composed. It is usually easy to make up many 
such items in a relatively short period of time. However, the true-false 
item has several serious faults, and it is not recommended for general 
use. 

One serious fault of the true-false item is that it is greatly influenced 
by guessing. Even if a student knew nothing about the subject matter, 
he could get approximately half of the items correct by flipping a 
coin. What this means in practice is that about half of the items are 
wasted. If the true. false test contains sixty items, it is unlikely that 
many students will score less than 30, because they could get thirty 
correct purely by guessing. This means that the score range is severely 
limited. The same range of scores could probably be obtained with 
forty multiple-choice items having four alternatives each. 

In addition to the restriction of range, the large amount of guessing 
that can occur introduces a considerable amount of unreliability. 
Teachers should be warned that they cannot get rid of the measure- 
ment error due to guessing by applying the numerous correction 
formulas which have been proposed, e.g., number right minus number 
wrong. Not only are these formulas of doubtful accuracy but they do 
nothing to remove the measurement error which is introduced, The 
only way to lower the measurement error due to guessing is to make 
the true-false test very long—more than sixty items. 

Another serious fault of true-false items is that it is usually very dif- 
ficult to make statements that are either absolutely true or false. In 
the example above it is absolutely true that oil is the major export of 


115 


Test Items 


7, — Fy n 
Venezuela, but let us look at two items in which the absolute truth is 
in doubt. 


T F The major problem of nations in Africa is to develop more industry. 


Although it is agreed that creating new industry is one important prob- 
lem, the discerning student may rightly contend that it is not the most 
important one. He might argue that more essential is to develop politi- 
cal stability. 


T F Regardless of how loud they are, all sounds travel at the same 
speed. 


In one sense the statement is true: loudness does not influence the 
speed of sound. For that reason the knowledgeable student might be 
inclined to mark “true.” But to do so would ignore the fact that the 
speed of sound is influenced by temperature and the medium in which 
it travels. If he takes this aspect of the statement into account, he 
would mark “false.” To respond correctly, the student has to guess 
what the teacher has in mind. 

There are few subject matters in w 
statements can be made. Even then it is often necessary for teachers 
to supply considerable detail in the statements to ensure that students 
are not misled. Consequently, a good true-false test is often more time 
consuming to construct than a multiple-choice form. 

An item type similar to the true-false form is one that presents two 
statements from which the students is to pick the one that is more 
correct: 


hich absolutely true or false 


ed of sound is directly proportional to the loudness. 


s= The spe 
d is not influenced by loudness. 


— The speed of soun 


This is a much better item than the previous one. Here the student 
can clearly see that the central issue is the effect of loudness. Although 
this type of item avoids some of the ambiguity of the true-false form, 


it does not reduce the effect of guessing. 

A third major fault of true-false items is that they introduce un- 
Wanted test-taking habits. Some students typically mark many answers 
true” no matter what test they are taking and regardless of how 


much they know. Other students have a habit of marking “false.” 


These general tendencies are compounded with guesses about what 


Proportion of true and false answers the teacher typically uses. Stu- 
ents usually assume that most teachers have a near balance of true 
and false answers, which is not a bad assumption. Students often learn 


116 


Construction and Use of Teacher-made Tests 


that a particular teacher usually employs many more true or many 
more false items. These test-taking habits, assumptions about usual 
balance of true and false items, and guesses about the proclivities of 
particular teachers can only work to lower the effectiveness of the 
tests. 

Fill-in. In the fill-in item the student supplies the one term, name, 
or date which either completes a statement or answers a question. 
Two examples are: 


The Constitution requires that a member of the United States House of 
Representatives be at least — years old. 


How long must an individual be a citizen before he is eligible to be- 
come a member of the United States House of Representatives? 


The fill-in type of item is actually midway between objective and essay 
forms. One might say that the fill-in item is actually a short, short 
essay item, because, rather than choose among given alternatives, the 
student must supply the correct answer, The fill-in item is included 
among the objective types because, as it is treated here, only one term 
must be supplied, which can be stated in advance of testing. 

The fill-in type of item is most useful when there are many simple 
names, dates, and facts to be learned. If the part to be filled in re- 
quires several terms and/or sentences, the fill-in item is usually in- 
appropriate. Relating to this principle, the author once encountered 
the quintessence of ambiguity in a fill-in item on a test for a driver's 
license: 


Most automobile accidents are caused by 
and = 


In a puckish mood, the author filled in men, women, and children. 
Although these were not at all the answers which the examiners had 
in mind, they had to admit that the answers pretty well covered the 
field. 

Even when there is only one term to be filled in and the teacher 
has in mind an unambiguously correct response, the phrasing of the 
statement might confuse the student. Several principles will be given 
for unambiguously phrasing fill-in items: 

1. Use only one or, at the most, two blank spaces. The three-blank 
item above was given to show how ambiguous fill-in items often are 
when more than one term is to be filled in. 


2. Make sure that only one term will sensibly complete the state- 
ment or answer the question. A poor example is: 


Venezuela is located _ 


117 
Test Items 


In addition to the intended answer of “in South America,” equally 
reasonable would be “far from us,” “North of Brazil,” and (for the fun 
of it) “with difficulty.” Better would be: 


7 5 4 
Venezuela is located on the continent of 


Another example of a fill-in item in which numerous responses would 
be correct is: 


Oxygen is essential for 


Equally good responses would be “combustion,” “breathing,” and “sub- 
marine crews.” Better would be: 


Which one of the gases in the atmosphere is essential for combustion? 


5 rather obvious instances in which the 
intended response is ambiguous. In more subtle ways, teachers often 
mislead students in regard to which answer should be inserted. The 
usual fault is that teachers fail to add sufficient detail to pinpoint the 
correct answer, After the test is administered, teachers often learn 
from the responses of students that more than one answer could prob- 
ably be inserted, and, in this way, they gradually learn how to write 
less ambiguous fill-in items. 
3. Leave only important terms blank. A poor example is: 


The two examples above are 


In 1492 Columbus — America. 
Obviously better would be: 


Columbus discovered America in the year E 
4. Place the blank space near the end of the sentence. & poor ex- 
ample is: 

The 1s authorized by the United States Constitution to 
try all cases of impeachment. 


Better would be: 


The United States Constitution states that all cases of impeachment will 
be tried by the —— 


est place in most test construction 


The fill-in item has only a mod 
(a) not many subject 


Problems. To summarize the major drawbacks: 


118 
Construction and Use of Teacher-made Tests 


matters concern simple facts of a kind that are usually required; (b) 
to rely solely on fill-in items would make it very difficult to test for 
students ability to “reason with”; and (c) even when such items are 
appropriate, it is difficult to phrase items with sufficient clarity to keep 
from confusing students. 

Matching. In the matching item students are presented with two 
lists of names, facts, or principles. A simple example is as follows: 


Pizarro a. visited China 

Balboa __ b. explored the Hudson River 
Magellan c. discovered Japan 

Cortez discovered the Pacific Ocean 
sailed around the World 
explored Mexico 

conquered Peru 


a 


d — 8 


In the blank spaces students are asked to write the letter correspond- 
ing to the correct activity on the right. The major advantage of the 
matching item is that it can condense a considerable amount of mate- 
rial in a short space. The matching-type item has definite advantages 
if some simple rules are followed: 

l. One of the two lists of “things” should be at least 50 per cent 
longer than the other. If the two lists are of the same length, guessing 
plays a prominent part in matching items. If, in the example above, 
there were only four activities on the right to go with the four names 
on the left, and a student knew three correct matchings, he would get 
the fourth one “free.” If there were only five activities on the right, he 
would have a fifty-fifty chance of correctly identifying the fourth ex- 
plorer even if he flipped a coin. Consequently, one list should be con- 
siderably longer (at least 50 per cent longer was recommended) to 
lower the influence of guessing. It does not particularly matter whether 
the longer list is the one on the right or the one on the left. In the ex- 
ample above, the influence of guessing could have been lowered by 
listing seven explorers to be matched with only four adventures. 

2. The shorter of the two lists should contain no more than about 
six entries. When lists are longer than the maximum recommended, 
students get lost in scanning the two lists and might make incorrect 
matchings purely because of clerical errors, Also, as the length of the 
lists is increased beyond the maximum recommended, the time to 
complete the matching goes up markedly. 

3. All the entries in cach list should relate to the same central theme. 
In the example above it would have been poor practice to put “George 
Washington” in the list of names because he clearly would belong in a 
different era and relate to different historical events, For the same 


119 
Test Items 


reason, it would have been wrong to include in the list of activities 
signed the Declaration of Independence” or “invented the micro- 
scope.” If such implausible entries are placed in either list, students 
may be able to correctly match, or rule out an entry as relating, purely 
because of the striking dissimilarity with the other entries. 

4. The test instructions should clearly state how matching is to be 
performed. Students should be told if they are to locate (a) some- 
thing an individual did, (b) facts that relate to principles, (c) defini- 
tions for terms, or (d) whatever else is at issue. In particular, students 
should be told if more than one match is to be made for each entry 
and/or if an entry on one list can be matched with more than one 
entry in the other list. Generally it is better to avoid both of these 
practices: Require that only one match be made for each entry in the 
first list and that no entry in the second list be matched with more 
than one entry in the first. If double matching as in either of the two 
examples above is permitted, the student is placed in a quandary simi- 
lar to that he faces in true-false tests. He must not only decide which 
entry matches best, in a relative sense, he must make decisions about 
absolute matching. This would not be a great drawback with material 
xample above. However, when the en- 


as simple as that used in the en 
tries concern terms and definitions, facts and related principles, and 


1 such complex material, it is unwise to assume that entries can 
be stated in sufficient detail and clarity to make absolute matching 
Possible, 

Customarily the matching item is used when there are many simple 


facts, dates, names, and definitions to be remembered. However, the 


matching item can serve to test more important and more complex 


merit 5 
mental processes. An example is: 


Event Principle 
1. Rocket flight a. Archimedes’ principle 
2. Book on shelf b. Friction 
3. Weight of object under water c. Siphoning 
4. d. Potential energy 


Warming hands by 


rubbing together e. Magnetism 


f. Equal and opposite reaction 


Tn order to correctly “match,” students must understand the principles 

and see their relevance for events in daily life. Many more examples 

Could be given in which carefully composed matching items can meas- 

ure the ability to “reason with.” 

eee Choice. By far the most popul 

eee which the student is required to choose 
a problem or question. A simple example is: 


ar type of objective item is 
one alternative response 


120 
Construction and Use of Teacher-made Tests 


Balboa’s major discovery was 

a. the coast of North America 

b. the Pacific Ocean 

c. a new route from Europe to the Far East 
d. the Inca civilization 


Multiple-choice items either begin with an incomplete sentence, as in 
the example above, or with a question such as “What was Balboa's 
major discovery?” Both of these will be referred to as “the problem. 
It does not greatly matter whether the problem is stated in the form 
of an incomplete sentence or a question. Most experienced item writers 
prefer the incomplete sentence because (a) it can save space both in 
the problem and in the alternatives, and (b) if well constructed it 
permits a smooth and rapid transition from reading the problem to 
seeking the correct alternative. The potential disadvantage of the 
incomplete statement is that, if the teacher is not careful, some of the 
alternatives may be phrased in such a way that they do not follow 
grammatically or would make awkward sentences, Examples of these 
will be given later. Also, it usually is found that students below the 
sixth-grade level more easily comprehend items expressed as questions 
rather than as incomplete sentences. The versatility of the multiple- 
choice item permits the testing of many aspects of learning, some of 
the most prominent of which are as follows: (To save space, only the 
correct response will be given for each item.) 
1. Definitions: 
The soft coal mined in Alabama and adjoining states is called 
bituminous 
. Facts: 
At sea level in an open container, the boiling point of water is 
212°F 
3. Cause: 
One important cause of feeblemindedness is 
heredity 
4. Association: 
Tornadoes most frequently occur when 


warm and cold air masses collide 
5. Evaluation: 


we 


In hilly country one of the most effective ways to prevent erosion is 
contour plowing 
6. Purpose: 
The purpose of the Bill of Rights is to 
guarantee the freedom of individuals 
7. Application of principle: 3 
The law that “for everv action there is an equal and opposite 
reaction” explains why 
rocket motors can supply thrust above the atmosphere 


121 
Test Items 


8. Implication: 
The switch in consumer spending from “goods” to services hinders 


the continued growth of industry 


ae are only some of the many kinds of learning that can be tested 
nultiple-choice items. Not only are they very versatile, but 
multiple-choice items are relatively free of some of the ills that beset 
other types of objective items. Unlike true-false items, multiple-choice 
items do not require that one alternative be absolutely correct. With 
multiple-choice items the requirement is that one of the alternatives be 
markedly better than the others. Multiple-choice items are relatively 
free of test-taking habits. Because students must choose one response 
for each item, there is no opportunity for students to manifest their 
Proclivities for saying “true” or saying “false” in general. Also, multiple- 
choice tests are not influenced by students’ hunches as to how many 
$ there usually are four or five 


5 
true” res 

ue” responses are on the test. Because 
are not influenced by guess- 


alternatives on multiple-choice items, they 
ing to the same extent as true-false items. 

Multiple-choice items are not subject to the same amount of am- 
biguity as are fill-in items. Also, because multiple-choice items rest on 
the principle of selecting the best alternative rather than supplying one 
Which is absolutely correct, the problem need not be specified as 
elaborately as is often required with fill-in items. 

Although, as was said previously, the matching item can be used for 
many purposes, it usually is easier to test more complex aspects of 
learning with multiple-choice items. If the things to be matched are 
long and involved, the matching item becomes tedious to read and 
Students often get lost in the maze of comparisons. This difficulty is 
lessened by using the multiple-choice form. 

Unless teachers feel much more comfortable with other types of 
objective items and/or they strongly feel that the particular subject 
Matter requires other types of objective items, it is strongly recom- 
mended that the multiple-choice item be employed for most objective 
and importance, the following section 


tests s 
ests. Because of their wide use 
a the rules for writing good multiple- 


wi : é > 
1 11 go into considerable detail ox 
choice items. 


Rules for Writing Multiple-choice Items 

that test constructors have had and on the 
a list of rules has been formulated 
ms. The most important rules 


1 on the experience 

re that has been performed a lis 

will T construction of multiple-choice ite 
de discussed in this section. 

l. The problem should clearly point to the 

The multiple-choice item should not merely 


theme of the correct 


altey 4 
ternative answer. 


122 
Construction and Use of Teacher-made Tests 


present a collection of unrelated facts or ideas, one of which is true 
and the others false. Instead, a clear question should be posed by the 
problem which can be reasonably answered by one, and only one, of 
the alternative responses, 


VIOLATION: 


In southern Africa 

a. mining of diamonds is one of the most important industries 

b. tropical rain forests abound 

c. camel caravans carry trade goods 

d. economic development is hindered by a lack of modern transportation 


When the problem begins with such a vague statement as “In southern 
Africa,” many different responses could follow. The problem might as 
well have been stated “Which one of the following things is true of 
southern Africa?” Unless the one “true” and the remaining “false” 
alternatives are stated very precisely, which often requires them to be 
very long, then the intended correct alternative does not logically 
follow from the problem. In the example above, the teacher intended 
alternative a (mining diamonds) to be correct, but because of the 
failure to establish the theme of the intended response, students could 
also make a good case for alternative d. 


IMPROVED: 


Southern Africa leads the world in the mining of 
a. bauxite 

b. diamonds 

c. iron ore 

d. coal 


Here the problem clearly focuses on mining products; the qualification 
“leads the world” further specifies the correct alternative; and the 
response “diamonds” is definitely better than the others. 

2. Incorrect alternatives should be plausibly related to the problem. 
In some items the incorrect alternatives are so completely unrelated to 
the problem that, even if the student knows very little about the prob- 
lem, he can rule out all but the correct alternative, 


VIOLATION: 


The vessel which carries oxygenated blood from the heart to the body 
is called the 

a. trapezius muscle 

b. fore brain 


123 
Test Items 


c. patella tendon 
d. ascending aorta 


Even if a student knows little about the circulatory system, he probably 
knows that muscles, tendons, and parts of the brain are not blood 
vessels, Consequently, he rules out all but alternative d. 


IMPROVED: 


The vessel which carries oxygenated blood from the heart to the body 
is called the 

a. vena cava 

b. pulmonary artery 

c. femoral artery 

d. ascending aorta 


Tn the improved version, unless the student knows the major veins and 


arteries, he cannot rule out the incorrect alternatives. 


The rule in composing alternative answers is that, to the student who 
ble; to the student who 


knows the answer, only one alternative is plausi 
does not know the answer, all the alternatives look equally plausible. 
To achieve this ideal state, teachers must keep a fine balance between 
(a) making incorrect alternatives so obviously incorrect and/or un- 
related to the problem that nearly all students will mark the correct 
alternative, and (b) ensuring that one alternative is actually more 
Correct than the others. 

3. Correct alternatives should not be consistently different in ap- 
Pearance from incorrect alternatives. Only knowledge of subject 
Matter should provide clues about the correct alternative. 


VIOLATION: 


a. 424°F 

b. 282°F 

c. 212°F at sea level, in an open container 

d. 98°F 

the problem in order to point the reader 


It cure ; 
Was not necessary to include i 
4 ive c is so much longer and more de- 


1 correct answer. Alternat 
ailed than the others that it is a dead give-away. 


IMPROVED: 


The boiling point of water at sea level, in an open container is 


a. 424°R 


124 
Construction and Use of Teacher-made Tests 


The violation above was a rather obvious example of making the 
correct alternative look different from the others. In a less obvious way 
teachers often make the correct alternative look different. This is often 
done by (a) making correct alternatives consistently longer than in- 
correct alternatives, and (b) using extra qualifications and more de- 
tailed specifications for the correct alternative. The best way to avoid 
these two faults is to include all necessary qualifications and specifica- 
tions in the problem, for instance, in the example above to include in 
the problem “at sea level, in an open container.” 

In some items it is difficult to prevent the correct alternative from 
being longer and more highly specified than the incorrect alternatives. 
This will not necessarily provide unwanted clues to students if the 
correct alternative is not consistently different in appearance. To 
balance out such clues to students, in some items some of the incorrect 
alternatives should be longer and more highly specified. 

If the teacher has time to perform a little experiment, he can deter- 
mine to what extent the appearance of alternatives is providing clues 
to students. This is done by presenting students with the sets of alterna- 
tive responses but deleting the problems, as was done in the example 
above. If students can guess the correct responses without even seeing 
the problems, the teacher is not making correct and incorrect alterna- 
tives uniform in appearance, 

4. Alternatives should be randomly ordered for each item, Teachers 
often unwittingly place the correct alternative more frequently in the 
middle of the list rather than in either the first or last positions. Or, if 
on one item the correct alternative is in the a position, teachers are 
prone to place the correct alternative for the next item in the last 
position. 

By far the best method of ordering alternatives is to do it randomly. 
One way to do this is as follows. Typically, in composing alternatives, 
teachers will write down the correct alternative first, then write a num- 
ber of incorrect alternatives. One way to rearrange the alternatives in 
a random sequence is by the use of shuffled cards. If, for example, there 
are five alternative answers for each item, the letters a, b, c, d, and e 
are written respectively on five cards. These are then shuffled four or 
five times and dealt out. The letter on the first card determines the 
position of the first alternative as it appeared when the item was con- 
structed, which, as was said previously, is usually the correct alterna- 
tive. The next card dealt determines the position of the first incorrect 
alternative, and so on until the positions of all five alternatives are 


125 


Test Items 


determined. When such a random procedure is used, students cannot 
accurately detect patterns in the ordering of alternatives. 

5. Avoid irrelevant sources of difficulty in the statement of the 
problem or in the alternatives. Items are supposed to be made dif- 
ficult because of the particular subject matter and not because of 
irrelevant sources of difficulty. 


VIOLATION: 

Hitler’s first major transgression of the Versailles concordat was 

a. invading Poland 

b. occupving Bavaria 

c. occupying the Rhineland 

d. invading Austria 
In the example above irrelevant difficulty is introduced by the “fancy” 
words “transgression” and “concordat.” Even a student who knew a 
great deal about the subject matter might mark the item incorrectly 
purely because of the unnecessarily difficult words. 


IMPROVED: 


Hitler first disobeved the Versailles treaty when he ordered German 


troops to 


In some cases teachers want to test the ability to use technical terms, 
such as those to be found in biology, chemistry, and physics. However, 
if difficult words are employed, they should be ones that are directly 
appropriate to the subject matter. If, for example, in a calculus prob- 
lem, several French words were used, a student would have to under- 
stand the French words before he could possibly solve the problem, no 
matter how well he understood calculus. Usually it is wise to keep the 
terminology as simple as possible. When it is desired to CN for word 
nowledge in general, or knowledge of the technical cone a p 
ticular field, specific parts of the test should be reserved for that 
Purpose, 

6. Avoid including material in the problem which is unrelated to the 
theme of the intended response. The problem should be stated in 
Sufficient detail to orient the student to the desired response. However, 
if superfluous details are inserted in the problem, they will serve to 


c 
Onfuse students. 


VIOLATION: 


John, who is a handsome boy and makes good te in high school, 
has an TQ of 105. His intelligence would be classified as 


a. superior 


126 
Construction and Use of Teacher-made Tests 


b. genius 
c. average 
d. below average 


The clause “who is a handsome bo and makes good grades in high 
school” is completely superfluous and misleading. If the teacher wanted 
to test for an understanding of how intelligence tests are used to classify 
students, the problem would be better phrased as follows: 


IMPROVED: 


John has an IQ of 105. His intelligence would be classified as 


When stated in that way, alternative c (average) is clearly correct. The 
danger of including superfluous material is that it may mislead the 
knowledgeable student into marking one of the incorrect responses. In 
the example above, the phrase “makes good grades” might set the 
knowledgeable student to wondering if the 1Q were a good index of 
John’s true ability and cause him to mark alternative a (superior). 
In general it is wise to include only those details that are necessary to 
“aim” the student toward the intended response, 

7. Do not employ alternatives which say “none of the above,” “both 
a and c above,” “all of the above,” etc. 


VIOLATION: 


The purpose of the Bill of Rights was to 
a. free the slaves 

b. give everyone the right to vote 

c. ensure the freedom of individuals 

d. none of the above 


If “none of the above” were not included as an alternative, alternative 
c clearly would be the best answer. With “none of the above” included 
as an alternative, knowledgeable students are placed in a dilemma: 
they must decide whether or not “ensure the freedom of the individual 
is absolutely correct, The clever student might think of other reasons 
why the Bill of Rights was enacted, e.g., to prevent the central govern- 
ment from becoming too powerful. Consequently, a student might mark 
“none of the above” because he knows so much rather than because he 
knows so little. The use of “a and c above” would pose even worse 
dilemmas for students. Although the Bill of Rights does not directly 
mention slavery, it has definite implications for human bondage. Conse- 
quently, the knowledgeable student must ask himself, “What does the 
teacher want?” 

The use of “none of the above” and other such alternatives changes 


127 
Test Items 


e bas g g the most correct alternative 
i <i ne most t alt 
th be is for answering from 5 . p t v b ut ly 1 a 
y € es a a 

o one of seeking one or more alternativ at are 5 solutely correc 

k aed 25 a iffi i in true-false items. The use of 

This introduces all the difficulties found ir t sere 55 

“none of the above” and other such alternatives is ¢ efinitely not reco 

8. j : g z res that give away 
3. Avoid grammatical cues and sentence structures t g y 
: l 

the correct alternative. 


VIOLATION: 


85 speed in a 
A feather and a rock would fall at the same sp 
a. atmosphere 
b. vacuum 
c. any fluid 
d. gases 


snow ing about physics. 
To get the correct answer, students need know ar hac aaa a i 
It cannot be “atmosphere” or “any fluid, ey on in “a gases,” 
j k ve 
“a.” Since the sentence cannot (grammatica y) 
a. Since the s 


alternative d is ruled out. 


Ortun. in tes -made 
is $ ind in teacher-ma' 

P treme ineptness is seldom fou 

Fortu ately, such extreme inep 


5 to 
ical cues often help students 
items, but to a lesser extent . ee most shen the case 
r i rT lternatives. 1 5 
rule out some of the incorrect a 8 e aramat bü 
8 1s rules of g 
when some of the alternatives break no obvious 
i rd se eS. i 
SOP wona minke n l statements. In some items 
: Use 1 sparingly ‘3 1 look for an alternative that 
Tak 85 . s ents Ic a 3 : 
the only sensible course is to have stude a principle. If done sparingly, 
does not apply or does not follow from i tee Adee with aeing 
Some use of the negative is understandab a tudents, who are more 
- sing to st S, 
i as confusing 75 
Negatives is that it is somewhat SES REE, SS 
5 5 - than incorre zt 
oo to seeking correct answers rather 11 05 V 
Phrases such as Which one of the fol at Be Janas wes noi and 
i ts isread it as “was 7 vel the 
cents ake = tO misread it a st correct rather than 
are prone 8855 89115 
thus mark he alternative which appears aoe iets ate 
One which appears most nage Also, ae than in “degrees 61 
incor S 
to think i -ms of “degrees of in 
ink in terms of “deg 
`i » : i 95 
Correctness, ive positive rather than aang 
Wh ible it is better to Bl e used, make sure tha 
pression; paths When negatives must r ace 5 
the ne aie F d m phrase is underlined or m 1 Date 
egative wor BRAY 
clearly evident to students. By 5 ewe d in some of the alterna- 
F j 8 é 33 an 
i : problem a 
Negatiy, ives in both the p 
es—negatives 11 
tive 
s 
i relates t 
10. Ensure that item content 85 8 
Subject matter, The final and most sig 


o important aspects of the 
ant rule is to ensure that 


128 


Construction and Use of Teacher-made Tests 


something worthwhile is being measured. No matter how faithfully 
teachers apply the other rules of item construction, if their items con- 
cern only trivial facts, nothing can be done to save the test. 

In all units of instruction it is important to test for the memory and 
understanding of some names, dates, terms, and simple facts. However, 
it must be ensured that such simple facts are truly necessary as a 
foundation for deeper understanding. Students cannot understand the 
functions of the heart unless they remember “left ventricle,” “right 
ventricle,” etc. Students must remember the multiplication table or they 
will be forever troubled in more complex mathematical problems later. 
Students must remember simple rules of grammar before they can go 
on to higher forms of composition, In these instances it is proper to test 
for the memory of simple facts. In many cases teachers devote an 
inordinate proportion of their test items to the memory of simple names 
and facts either because they actually value this type of knowledge or 
because they are not skillful enough to construct better tests. 

As has been said and demonstrated in a number of places so far in 
this book, objective test items can test many facets of understanding in 
addition to memory of details, Examples have been shown (and others 
will be shown later) where objective test items can measure various 
aspects of “reasoning with” including (a) implications of facts, (b) 
relations between concepts, (c) practical applications of principles, 
(d) deductions, and (e) critical reactions, Although it takes patience 
and skill to measure “higher” forms of knowledge with objective test 
items, it can be done. 


Construction of Essay Items 


Tn essay items students supply their own answers in their own ways. 
Tf well constructed, essay items can be used to measure the students’ 
ability to “reason with,” to organize their thoughts, and to express 
themselves in writing. To help ensure that these important aspects of 
learning are actually measured, the following rules should be heeded: 

l. Employ a relatively large number of short-answer items rather 
than a relatively smaller number of long-answer items. Long-answer 
questions are prone to a number of faults. Students often get quite lost 
in their own answers, tend to repeat themselves, and are not sure when 
they have said enough. Teachers get lost in reading page after page of 
response to one question, forget what the student said earlier, and have 
difficulty in forming a reliable impression of the quality of the response. 
In long-answer questions, it is very difficult to accurately aim the 
student toward the types of responses which the teacher wants. 

Long-answer essay items often turn into a speed-of-writing contest, 
in which students try to write as much as possible, and teachers are 
often unduly influenced by the sheer length of the response, On long- 


129 
Test Items 


answer questions, the highly verbal student has a distinct advantage 
even if he is not highly knowledgeable about the subject matter. 

Another disadvantage of the long-answer item is that only a small 
number of items can be used in the test because of the limited amount 
of time that is available for most classroom examinations. This usually 
results in a very poor sampling of the total subject matter to be covered 
by the test, and, consequently, luck (measurement error) plays a large 
Part in the results. 

Far better than to use a relatively small number of long-answer 
questions is to use a larger number of short-answer questions, By the 
latter is meant a question that is answered in no more than one-half 
page (8 by 11 inches). If well constructed, the advantages of short- 
answer questions are (a) students are more easily “aimed” at the 
correct response, (b) speed of writing is not strongly influential, (c) 
teachers have more concrete standards for grading, and (d) the ques- 
tions can be ranged broadly across the subject matter. 

To ensure that students actually restrict themselves to a limited 
amount of space, teachers should inform students of the amount of 
Page space that can be used for each question. One way to do this is 
to inform students that questions 1 and 2 must be answered on the 

rst page, 3 and 4 on the second page, and so on until the end of the 
test. Even better, if duplicating services and sufficient paper are availa- 
ble, the questions can be reproduced with lines drawn on the paper to 
indicate the areas within which answers must be given. 

2. Provide enough detail in the question to accurately aim students 
toward the 057 0 response. One of the major faults of many essay 
items is that they pose such global and/or ambiguous questions that 
even an expert in the field could not provide the teacher with an A 
answer, Rather than being a simple rule of test construction which any 
teacher can easily heed, the accurate “aiming” of essay questions is a 
Major part of the art and skill of test construction, Because of the 
portance of the problem, a number of examples will be given of 


Violations” and “improved” versions: 


VIOLATION: 
¥en bee 
What happened to art during the Renaissance: 


ion i od that! answer could 
The question is so vague and open-ended that almost any answer 


ollow, 


IMPR 2 
OVED: 
Renaissance, considering (a) 


Describe the changes in art during the 
= figure, (c) use of color, and 


Stvlization, (b) portrayal of the human 
4) sources of support for artists. 


130 
Construction and Use of Teacher made Tests 
4 
* 
VIOLATION: 


Why was the Bill of Rights enacted? 


Here again the question is so vague that almost any answer could 
follow including “because it was a good idea.” 


IMPROVED: 


Describe the political and popular support given to the enactment of 
the Bill of Rights, In your answer discuss (a) the types of “rights 
covered, (b) the relations of the Bill to similar provisions in state con- 


stitutions, and (c) the part James Madison played in the enactment of 
the Bill. 


In the improved version, students have a much better idea of what the 
teacher wants. 


VIOLATION: 


What are Newton's laws of motion? 


The question obviously lacks the needed amount of specification to 
accurately aim students toward the correct response. 


IMPROVED: 


Describe each of Newton's three laws of motion. Illustrate each with 
the action of the ball in a game of baseball. 


The major reason why so many essay items fail to accurately aim 
students is that teachers do not supply the specific a, b, and c of what 
is meant. In constructing items it takes only a few more minutes to 
supply the detailed specifications. 3 

Even with the detailed specification, in some types of questions it 
is difficult to accurately aim students. In such cases a number of tech- 
niques prove helpful. One of these is to give the student a “hint,” or 
example, of the type of response which is being sought: 


In a “right” triangle, the “side opposite” divided by the “hypotenuse 
forms a ratio called the sine of the angle. There are six basic ratios in 
trigonometry. State the names and formulas for each of the other five. 


3. Require all students to answer the same questions. Unless 1 
students answer the same questions, the test is not really = 
i ee 8 onts. 

and it is very difficult to make accurate comparisons among studen 


‘ 


131 
Test Items 


One way in which this rule is violated is to give students a choice of 
questions. For example, the teacher might give eight questions and 
let each student choose four. Some teachers argue for the use of this 
procedure, because, as they say, it allows a student to appear at his 
best, However, the purpose of a test is not to see how well the student 
can do if he is allowed to pick his own questions, but rather to deter- 
mine how proficient he is when tackling a representative list of ques- 
tions. 

If students choose their own questions, it is difficult to make com- 
parisons among students. Conceivably two students could choose 
entirely different questions, and there would be no concrete basis for 
comparison. Also, to let students choose their own questions places too 
much of a burden on the students judgment of his own areas of com- 
petence. The student may think that he knows more about the battles 
of the Civil War than about the Reconstruction period and choose the 
question relating to the former topic, while, in fact, if he had written 
on the Reconstruction period, he might have received a better grade. 
Teachers have different standards for grading different questions. On 
some questions, teachers are satisfied if students know the bare bones 
of the issue. On other questions teachers expect students to have de- 
tailed knowledge. It is very difficult for students to accurately gauge 
these differences in standards, and when given a choice, they some- 
times choose questions on which they will do poorly. 

One reason why teachers sometimes give a choice of questions is 
that they think they are being “nice” to students. Students like to have 
a choice of questions. They like any procedure which appears to let 
them control the testing situation in such a way as to get a good grade. 
The students worst fear is that he will be “trapped” by a question, or 
questions, which he does not understand. Consequently, he feels that 

eing given a choice offers an escape hatch. 

If students know that they will be offered a choice of questions on 


the test, there is less motivation to cover all the material. They are lax 
in their coverage because they reason that, if permitted a choice, they 
should be able to find some questions which they can answer. All these 
things considered, it is definitely unwise to give students a choice of 


questions. 

4. Do not require too much writing for the time available. Teachers 
Sometimes require much more writing than most students can complete 
in the time available. This is done by either giving a few very-long- 
answer questions or many short-answer questions. When too much 
Writing is required, the test turns into a speed-of-writing contest. If a 
Student is only average in his ability to write rapidly, he might make 


only an average grade even if he is quite knowledgeable about the 


topic, 


132 
Construction and Use of Teacher-made Tests 


To accurately judge the amount of writing which should be required 
of students, teachers should first specify the total page space which 
will be permitted. This amount of space should then be proportioned 
into the number of questions which will be used. A good rule to follow 
is to require no more than two full pages of writing in fifty minutes 
from sixth-grade students. A half page of writing can be added for 
each grade above the sixth. Eighth graders can usually complete three 
pages of essay answers in fifty minutes; tenth graders can complete 
four pages; and twelfth graders can usually complete five pages. 
Although the rule is only approximate, being too lenient for some 
students and some topics, and too strict for other students and other 


topics, it offers a starting point for gauging the amount of writing that 
can reasonably be required. 


Item Analysis 


So far in this chapter we have talked about procedures that usually 
will provide good tests. In applying these rules it is necessary for 
teachers to use their own judgment as to how well items actually will 
work. Although such judgment is the primary basis for all good 
teacher-made tests, some empirical procedures and results are helpful. 

It is usually the case that teachers have administered some of their 
items to previous classes. If so, some simple statistical analyses of those 
results will provide some clues regarding which items should be chosen 
for new tests. 

Easiness. On objective tests, one of the most useful statistics is the 
per cent of students who get each item correct, which is called the 
easiness percentage. This is found quite simply by tallying the number 
of students who get the item correct and dividing this by the total 
number of students taking the test. Sample results are as follows: 


Item number Easiness percentage 
1 55 
2 34 
3 92 
4 45 
5 12 
6 69 


Of course in such a study there would be many more than the six 
sample items shown above. The study shows that some of the items are 
easy: 92 per cent get item 3 correct and 69 per cent get item 6 correct. 
Other items are difficult: only 12 per cent get item 5 correct. 

It is important to compute such easiness percentages because ex- 


133 
Test Items 


tremely easy or extremely difficult items add little information. In a 
statistical sense, the ideal item is one with a 50 per cent easiness rating, 
because such an item provides the maximum number of discrimina- 
tions. When an item is either very easy or very hard, it serves only to 
differentiate a relatively few students from the others. Item 3 above 
Serves only to differentiate 92 per cent who pass the item from 8 per 
cent who do not. 

Teachers may wonder how a test composed entirely of 50 per cent 
easiness items could differentiate truly bright students and truly dull 
students from the average. The reason that such items can effectively 
differentiate at all points on the score continuum is that on each item 
a somewhat different 50 per cent would get the item correct. Only if 
items measured exactly the same thing would the same 50 per cent get 
all the items correct. Because the 50 per cent would be composed of 
somewhat different students on each item, there is ample room for one 
student to get all the items correct and for another student to get none 
of the items correct. A complete, and highly reliable, distribution of 
Scores can be obtained by using items all of 50 per cent easiness. 

Although teachers do not need to go to the extreme of seeking items 

ar 50 per cent easiness, it usually is well 


all of which are at or quite nes 
to beware of items at the two extremes. They will not hurt the test, 
differentiating 


but they will take up room that could be given to more dii 
items, A good rule to follow is to use few items that are either above 
80 per cent or below 20 per cent. Of course the objectives of a par- 
ticular unit of instruction and the nature of the subject matter should 
lave precedence over statisti al considerations in the development of 
test items. Some topics inherently are either rather easy or rather 
difficult for students as a whole, but teachers will not want to entirely 
avoid these topics. Some rather easy items will be included in the test 
to give the below-average but deserving student a sense of accomplish- 
Ment, and some rather difficult items will be included to show students 
that they still have much yet to learn about the topic. However, even 
items intended to serve these purposes usually need not be beyond the 
20 to 80 per cent zone of easiness. : p 

The same type of analysis can be made of scores given to essay 
questions, For this purpose imagine that the teacher has assigned 
numerical grades to the individual answers of 5 through 1 correspond- 
ung to letter grades of A, B. C. D, and F. If the teacher has a collection 
of grades given to questions used in previous classes, he can compute 
the average easiness of each essay question. This is done by adding all 
the grades for the question and dividing by the number of students 
answering the question. Similar to the way in which extreme items are 
Weeded out of objective tests. extremely easy or extremely difficult 
items can be weeded out of essay tests. Using the grading scheme 


134 
Construction and Use of Teacher-made Tests 


above, these would be items with average scores higher than 4.0 or 
lower than 2.0. In the same way in which extreme items on objective 
tests provide little information about students, extremely easy or dif- 
ficult essay items fail to produce a sufficient “spread” of scores. 

Distribution of Answers to Alternatives. On multiple-choice items 
it is useful to compute the per cent of students who mark each alterna- 
tive. (For the easiness score, only the per cent marking the correct 
alternative was required.) An example is as follows: 


Southern Africa leads the world in the mining of: 


Per cent 
2 a. coal 
16 b. bauxite 
54 c. diamonds 
28 d. iron ore 


The first thing to note about the distribution of preferences for alterna- 
tives above is that, statistically speaking, this is a “good” item in the 
sense that about half (54 per cent) of the students choose the correct 
answer. In addition to learning about the easiness percentage, the 
distribution of preferences provides two other important types of 
information. First, it can be seen whether or not some of the incorrect 
alternatives are give-aways. If so, they should be discarded and re- 
placed by more plausible alternatives in future uses of the item. Such 
is the case with alternative a above which is marked by only 2 per cent 
of the students. 

It is difficult to give a complete rule for the weeding out of give-away 
alternatives, because the standard would fluctuate with the number of 
alternative answers being used. One useful standard regardless of the 
number of alternatives is to replace those that do not receive at least 
5 per cent of the marks. Another standard, which is more complex to 
obtain but which takes into account the number of alternatives, is 
found by subtracting the per cent of persons who get the item correct 
from 100 and dividing the result by twice the number of incorrect 
alternatives. In the example above 54 per cent get the answer correct, 
leaving 46 per cent who miss the correct answer. Dividing 46 by 6 
(there being three incorrect alternatives ) gives a figure of approxi- 
mately 8 per cent. This figure equals one-half the average per cent 
given to the three incorrect alternatives. Another way of saying it is 
that if the alternative does not reach the specified level, it is less than 
half as distracting as the average of the incorrect alternatives. If an 
incorrect alternative does not reach this level, it is not serving suffi- 


135 
Test Items 


ciently as a distractor and should be replaced by a more plausible 
alternative. In the example above, only alternative a (“coal”) would 
be replaced using the rule. 

The second important type of information provided by the distribu- 
tion of percentages is the likelihood that some items are ambiguous. 
Following is an example: 


Southern Africa leads the world in the mining of: 


Per cent 


a. cobalt 
b. bauxite 
c. diamonds 


d. iron ore 


Because “coal” proved to be too poor a distractor, the teacher replaced 
it with “cobalt,” which in subsequent use of the item proved to be an 
e than the intended correct alternative 
(“diamonds”), Whenever more students mark one of the intended 
incorrect alternatives than mark the intended correct alternative, the 
teacher should carefully inspect the item for sources of ambiguity. 
When this happens, students often have a good reason for their “in- 
Correct” choice. In the example above, the 42 per cent who marked 
Cobalt” might say that they were including the Congo as part of 
Southern Africa, ‘and unless the text or classroom instruction has 
Specifically ruled out this identification, then “cobalt” is as good an 


answer as “diamonds.” 

Although some would defend the use of highly popular incorrect 
alternatives as providing good misleads, it is usually the case that such 
items really entail some ambiguity which is serving to “trap” even very 
‘nowledgeable students. In general it is unwise to retain incorrect 
alternatives that are more popular than the designated correct alterna- 
ive, 


eye: 
even more popular respons 


_ Discrimination. Another important method of analyzing test results 
is to determine the extent to which each item “goes along with,” or 
Measures the same thing as, the total test in which it is included. There 
are several ways to do this, one of the most simple of which is 
described as follows. First, find the top 25 per cent of students in terms 
Of total test scores. Next find the bottom 25 per cent. lf scores are 
available on 100 students, this would represent the top 25 and the 
Ottom 25 in terms of total test scores. Next, for each item determine 
the Per cents of students in the top and the bottom groups who get the 
item correct, Finally, subtract the percentage for the bottom group 


136 
Construction and Use of Teacher-made Tests 


from the percentage for the top group. The resulting measure is a very 
important index of the extent to which each item contributes to the 
total test. Some sample results are shown in the following table: 


Per cent of Per cent of Difference 
Item bottom group top group (discrimination index) 
1 24 72 48 
2 48 32 —16 
3 54 56 2 
4 16 39 23 
5 18 84 66 
6 38 65 27 


The important measure above is the difference in percentages shown 
in the right-hand column. Other things being equal, the larger the dif- 
ference, the better the item. It should be obvious that if on a particular 
item the bottom group does as well, or almost as well, as the top group, 
the item is doing little or nothing to distinguish the bright from the 
dull. Such percentages of difference are almost always positive, an 
exception being item 2 above where 48 per cent of the bottom group 
gets the item correct and only 32 per cent of the top group gets the 
item correct. When the difference is negative, it usually means that the 
teacher has not chosen the correct alternative or thať the item is am- 
biguous, particularly for students who know the most. 

If the difference is small, the item is adding little to the total test. 
In general, it is wise to be Suspicious of items for which the difference 
is not at least 20 percentage points. Only items 2 and 3 above fail to 
achieve that standard, It is exceptional to find items that discriminate as 
effectively as item 5 above (66 percentage difference). 

There are several reasons why an item may fail to discriminate top 
from bottom students, among which are (a) ambiguity of wording, 
(b) extreme easiness percentage, and (c) incorrect designation of 
“right” answer by the teacher. If any of these are the case, the teacher 
would want either to delete or revise the item for subsequent use. 
However, there is a fourth possibility, one which is not easy to detect: 
the item may pertain to an important, but rather isolated, aspect of the 
subject matter, This might be the case if the item related to a specific 
assignment which was not covered in the lectures or text, Students who 
read the assignment would tend to get the item correct; students who 
did not read the assignment would tend to get the item incorrect. The 
tendency to get the item correct might then be relatively unrelated to 
general knowledge of the subject matter. This is particularly likely to 
happen when the teacher has not clearly indicated that the special 


137 


Test Items 


assignment would be covered in the test. If this is the case, the teacher, 
after inspecting the item, might decide that there is nothing wrong 
with it and that it should be retained on future tests. Instead of 
deleting the item, the teacher would make the assignment clearer to 
students. 

Although there are special cases, su 
which teachers may rightly retain non 
should usually be held suspect. It is very comforting to know that an 
item does serve to discriminate top from bottom students. If an item 
does not discriminate, the teacher should take a long, hard look at it 
before deciding to retain it in future tests. 

Teachers will usually not have the time to compute indices of dis- 
crimination for most of their tests. Such computational labors would 
probably be reserved for very important tests which are used with 
many students. It would be worth the trouble to compute indices of 
discrimination for an important final examination which is routinely 
used with several sections and on which many of the same items will 
be used from year to year. 

Indices of discrimination ( 
that provide similar results) are routin 
commercially distributed tests. Because 
fully selected, it is important to determine 
Criminates top from bottom students. 

Cautions in Item Analysis. One of the most important cautions is 
to pay little attention to statistical results unless they are based on 
relatively large numbers of students. All the indices discussed in this 
: n n based on only a small number of stu- 


ch as the one described above, in 
discriminating items, such items 


either the index described above or ones 
ely used in the construction of 
on such tests each item is care- 
to what extent it dis- 


Section are quite erratic whe 
dents. A good rule to follow is to take such statistical results seriously 
only if they are based on at least forty, and preferably one hundred, 

students are involved, the 


students, If no more than ten or twenty 
on chance. With that small a group of students 


very different results would probably be obtained if the items were 
administered to another group of students. The forty rule means that 
there should be at les t forty students tested to obtain the easiness 
Percentage and the distribution of percentages for alternatives; for the 
index of discrimination, there should be at least twenty students in 
the top and twenty students in the bottom group, requiring that at 
least eighty students be tested in all. 
5 A second important caution is that statistical results are useful only 
if the instruction and the text remain much the same. Obviously, if an 
item pertains to sections which were covered in an old text but not 
in a new one, the item will not serve as well in future tests as it has in 
the Past. Also, if classroom instruction or outside assignments have 
een markedly changed, the statistical results obtained from previous 


results depend greatly 


138 


Construction and Use of Teacher-made Tests 


students will supply misleading information about how items are likely 
to work in future tests. 

A third and final caution is that teachers should realize the important 
but limited extent to which statistical procedures alone can guide them 
in constructing good tests. Statistical results provide many hints and 
clues, particularly regarding possible faults in items. However, unless 
the original collection of items was carefully constructed in such a way 
as to broadly sample the important content, no amount of statistical 
analysis can turn a poorly conceived, badly written collection of items 
into a good test. Statistical procedures are most helpful in adding a 
final polish to a basically good group of items. 


Sum mary 


Items are the building blocks for tests, Unless the individual items 
are good, the total test cannot possibly be good. Writing good test 
items is an art which many teachers seem never to learn. Their major 
faults are (a) they fail to include important content in test items, and 
(b) they use poor journalistic style in composing items. For the former, 
it is important for teachers to arefully think out what they consider 
to be the important content, to outline the content, and to translate the 
outline into a good test. For the latter, teachers need to obey some of 
the rules for writing items which were discussed in the chapter and to 
gradually sharpen their journalistic skill through much practice in com- 
posing tests. However, learning rules for writing items and practicing 
writing items are not sufficient to achieve a high level of skill in test 
construction, In addition it is necessary to gather information from 
students and fellow teachers about the adequacy of one’s test items 
and sometimes to perform statistical analyses of the items within tests. 

There are many different types of test items—essay and objective, 
and numerous kinds of each. It was emphasized that most of these 
types of items can be translated into excellent tests if they are skillfully 
composed. Skillful writing of items is far more important than the par- 
ticular type of item which is used, It was recommended that a good 
combination of materials to use over a school term for evaluation are 
(a) some short-answer essay questions, (b) some multiple-choice ques- 
tions, and (c) themes, projects, and class reports. Taken together, these 
should provide an excellent basis for evaluating the progress of 
students, 


Suggested Additional Readings 


Ebel, R. L. Writing the test item. In E. F. Lindquist (Ed.), Educational meas- 
urement. Washington: American Council on Education, 1951, chap. 7. 


139 
Test Items 


Gerberich, J. Specimen objective test items. New York: Longmans, 1956. 

Ross, C. C. and Stanley, J. C. Measurement in today’s schools. (3rd ed.) Engle- 
Wood Cliffs, N.J.: Prentice-Hall, 1954, chaps. 6, 7. 

Stalnaker, J. M. The essay type of examination. In E. F. Lindquist (Ed.), Edu- 
cational measurement. Washington: American Council on Education, 1951, 
chap. 13, 

Thorndike, R. L. and Hagen, Elizabeth. Measurement and evaluation in psychol- 
ogy and education. (2nd Ed.) New York: Wiley, 1961, chaps. 3, 4. 

Vaughn, K. W. Planning the objective test. In E. F. Lindquist (Ed.), Educational 


measurement. Washington: American Council on Education, 1951, chap. 6. 


chapter 


Scoring, Grading, 
and Reporting 


In Chapter 5 we left Max Marshall pondering what type of test to give 
to his students in general science, Belatedly he followed our advice 
and composed an outline of the important content to be covered in the 
test. Then, while paying heed to the rules for writing good test items 
described in Chapter 6, he composed a test consisting of twenty 
multiple-choice items and four half-page essay questions, The testing 
session went well. None of the students seemed overjoyed with the 
test, but, on the other hand, no students complained bitterly about 
“confusing questions,” “material not in the book,” or “not enough time. 

Now Max has the papers back in his office. What does he do now? 
First, he must turn the results into numerical form in such a way as to 
show which students performed “better” and which performed “worse. 
Second, he must evaluate the results, give meaning to them with 
respect to his own standards, the standards of the schooi, and the 
standards of outside educational agencies. Third, if the scores are to 
supply needed information to students, parents, the school, and others, 
they must be reported in a manner that accurately communicates how 
well students performed. The purpose of this chapter is to discuss 
these three important steps required to transform raw test results into 
meaningful units of discourse. 


Scoring Objective Tests 


Scoring objective tests is purely a mechanical problem which LS 
quires no special skill. If it is truly an objective test, a scoring key is 
available stating the correct answer for each item, Scoring consists of 
checking the students response to each item to see if it is correct. 

The simplest, and usually the most sensible, way to obtain a total 
score for each student is to count up the number of correct answers. 
This would be, respectively, the number of true-false or multiple-choice 


140 


141 


Scoring, Grading, and Reporting 


items marked correctly, the number of correct matchings in matching 
items, and the number of correct terms supplied in fill-in items. 
Weighting. Some teachers would argue that it is not wise to obtain 
total scores by simply counting the number of correct responses. Their 
reasoning is that some of the items are more important than others and 
consequently, should be counted more (weighted more) than others. 
Is there much to be gained from “weighting” items, such as to count 
Correct responses to some items 1, and others 2, and still others 3? For 
two reasons, very little usually is gained by weighting items. The first 
reason is that it is very difficult to decide which items should receive 
which weights. Two teachers asked to make such weightings might 
disagree markedly, When there is no very sensible basis for weighting, 
the most sensible rule is to weight all items the same, which results in 
simply counting the number of correct responses throughout the test. 
; The second reason why little is to be gained from weighting is that 
it makes only minute changes in relative standings of students. A 
student who scored near the top of the class on an unweighted scoring 
would remain near the top of his class on a weighted scoring of the 
same test. The relative influence of weighting is affected by the length 
of the test. If there were only five items on the tests, “weights” might 
ence on the comparative standings of 
items, weighting would have some 
ems on the 


lave a relatively marked influ 
students, With as many as ten 
influence, but not much. If there were as many as twenty it 
test (and there almost always are), weighting would have practically 
no influence on relative standings. For these reasons it usually is wise 
to obtain test scores by simply counting the number of correct answers. 

Special Answer Sheets. It greatly facilitates the scoring of objective 
tests if students are provided with special answer sheets on which to 
is found that students in the fourth and 
ng answer sheets. Either 
ercial firms, or, with only 
produced in the 


ie responses. It usually MRES 
her grades have little diffculty in usir 
answer sheets can be purchased from comm 
a small amount of effort, they can be typed and re 


sch : 
ool, A sample is as follows: 


Ttem 


Correct answer Ttem Correct answer 

3 abede 21 abede 

2 abede 22 abede 

3 abede 23 abede 

4 abecede 24 abede 

20 abede 40 abede 

S 2 5 

tudents are told to mark out (with ordinary pencil) the letter corre- 


for each item. By using both the 


s 
ondi 1 
p nding to the correct alternative 
e than one hundred answers can 


ront ¢ 9 
nt and back sides of each page, mor 


142 
Construction and Use of Teacher-made Tests 


be obtained. In addition, spaces can be provided for name, section, 
age, and other pertinent information. 

When special answer sheets are used, tests can be quickly scored 
with the aid of a stencil. A stencil is made by cutting out the letter 
corresponding to the correct alternative for each item. This is most 
easily done by using one of the paper punches which are especially 
constructed for the purpose. 

After the stencil is placed over an answer sheet, each incorrect 
response is marked (through the hole in the stencil) with a colored 
pencil. The student’s total score is obtained by counting up the num- 
ber of colored pencil marks and subtracting that from the total number 
of test items. The colored pencil marks will serve to indicate to students 
both the items which they missed and the correct answers to those 
items. Of course it is always wise to do the scoring twice for all 
students to prevent mechanical errors. 

Corrections for Guessing. As was said previously, and will be re- 
affirmed here, corrections for guessing should not be made. Such 
corrections apply only if students are allowed to mark as many items 
as they choose, Far better is to require all students to mark every item, 
even if they feel that they are only making wild guesses. When all 
students mark all items, the correction-for-guessing formulas do not 
change the relative standings of pupils. The top student before the 
correction was made would remain at the top afterward, the bottom 
student would stay on the bottom, and all students would retain their 
rank-order positions. 

If students are allowed to mark different numbers of items, sheer 
guessing will penalize some students. Teachers have tried to rectify 
this inequity by applying numerous so-called “correction formulas. 
Most of these are quite illogical, e.g., number right minus number 
wrong. Even the most sophisticated statistical formulas have proved 
to work poorly in practice. Far better is to require all students to mark 
all items. Then each student at least has an equal opportunity to be 
lucky, and no correction formulas are needed. 

Some teachers object to having students attempt every item, because, 
as they say, it encourages guessing and sloppy thinking in all school- 
work, Even though it is doubtful that this simple practice has such a 
profound effect on students, good counter arguments can be given. It 
can be argued that if students attempt every item, they at least are 
actively trying and not giving up in defeat. Even “guessing” in this way 
forces the student to think about the item and to try to make a good 
guess. This often motivates the student to look up the matter in the 
text, which he might not have done if he had given up at first glance 
and not marked the item at all. Also, even when the student guesses 
incorrectly, it will mean more to him (than if he had omitted the item) 
when he sees the red mark on the returned test answer sheet. 


143 


Scoring, Grading, and Reporting 


In large measure learning consists in trying, finding out that you are 
Wrong, studying or thinking more, and trying again. To start the 
learning process, even a wild guess is psychologically better than 
no attempt at all. For these reasons, even if we were not forced into 
the practice because of the inequities that would otherwise result, it 
would probably be better pedagogical practice to require all students 
to mark all items. 


Scoring Essay Items 


Compared with objective items, it is much more difficult and time 
Consuming to turn essay responses into numerical form. Also many 
Special pitfalls must be avoided which do not arise when scoring 
objective items. Some of the most important rules for scoring essay 
tests are the following: 

Before scoring, outline the major points to be considered in each 
question, Both objective and essay questions should reflect the objec- 
tives of a unit of instruction. In framing questions, and in scoring the 
responses, teachers should ask themselves questions such as (a) What 
are the important facts and concepts covered during the last two 
Months? (b) What skills should have been developed? (c) What 


implications should students be able to see for events in everyday life? 


Vhen a teacher composes an essay question, he should have in mind 


an ideal answer. If he has the time, it is helpful to write out the ideal 
answer as a basis of comparison for students’ answers. At least he 
should write down (if only in his mind) some of the major points 
Which he wishes to see included, When scoring responses, however, he 


should not go directly by his preconceptions of what constitutes a good 
Foul all, or at least some, of the 


answer, First, he should read through . 
Students’ anstosrs, students often make excellent points which the 
teacher had not considered. Also, because of some ambiguity in the 
{uestions, students often take a different slant than had been intended. 
in looking back at the question, the different direction is defensible, 
the teacher should be flexible enough to score on the basis of how well 
either direction was followed in the answers. f 
Reading through all, or some, of the answers as à prelude to scoring 
a particular question has another advantage: it will help the teacher to 
ormulatè a realistic base line. It is often the case that, before reading 
Some of the responses, the teacher's expectations are far too high. After 
ene that a question is much more difficult than he had thought, he 
s likely to! re lenient. 
Develop 5 1 aile for use with all questions. One method 
Scoring each essay test is to affix a letter grade to each eon 
. Owever, this has several disadvantages. The principal disadvantage 
s that, after the letter grades are given to the separate questions, there 


of 


144 


Construction and Use of Teacher-made Tests 


is no easy way to add up the separate grades. For example, what total 
score should a student receive who, on four questions received letter 
grades of A, B, and two D’s? Alphabetical letters cannot be added and 
averaged, and, consequently, it is necessary to use some type of numeri- 
cal scale. 

A second disadvantage of affixing letter grades to each item is that, 
even if they lead to an over-all test grade (usually with much diffi- 
culty), they serve only to place students in a few broad categories. 
The most popular system of letter grades allots an A for excellent work, 
F (some systems use E) for failure, and B, C, and D for various 
intermediate levels of performance. If this system of grading is applied 
to each item and (by some type of legerdemain) the grades for items 
are transformed to over-all letter grades, the result is to place students 
in only five, nonquantitative categories. The result is too crude to be 
used with many of the statistical computations that are often necessary, 
such as those required to obtain percentiles, standard deviations, 
correlations, and item discrimination indices. 

Although teachers may rightly want to place a letter grade on the 
total test result, in addition it is very useful to have a relatively precise 
numerical score. Besides being useful for computing the types of 
statistics mentioned above, numerical scores will facilitate the combin- 
ing of several tests to form an over-all score for the term. 

It is not extremely important how many points the teacher adopts 
for his numerical scale, whether it has four or seven steps. The im- 
portant considerations are for teachers to define the meanings of the 
points and develop a scale with which they feel comfortable in con- 
tinued use. Most teachers find that less than five points (scoring 1, 2, 3, 
or 4 for each question) does not provide fine enough distinctions. i 
the other extreme, most teachers get lost in using more than seven O! 
eight points. The most widely used practice is to employ a five-point 
scale, with 5 corresponding to A, 4 to B, and so on to 1 for F. A more 
complete specification of the meaning of the scoring system is as 
shown in the table on page 145. The numerical scale (either the one 
above or one that the teacher develops on his own) is applied to 
each answer. Total scores on the test are obtained by adding up the 
numerical scores given to the separate questions. Although the final 
numerical total test scores may be far from perfect, they usually are 
good enough for the types of analyses that must be performed upon 
them. 

Weight each item in relation to the page-length limit for answers. 
Whereas it was recommended not to employ differential weights a 
the items on objective tests, it usually is wise to employ differentia 
weights for the questions on essay tests. Although there is no foolproo 
way for determining such weights, the most sensible procedure is ro 


145 


Scoring, Grading, and Reporting 


Score Letter grade Meaning 


5 A Excellent answer. Student supplied nearly all the 
pertinent information and showed very good judg- 
ment and understanding. 

4 B Basically good answer, but either lacks some impor- 
tant information or does not exhibit excellent judg- 
ment and understanding. 

3 C A very ordinary a swer—‘nothing to brag about.” 
Supplies only part of the required information and 

little real understanding of the problem. 


exhibit 

2 D Poor answer—“a shame the student doesn’t know 
more.” Only one or two ideas related to the problem 
and those not understood. 

1 F Clear failure. Nothing, or almost nothing, related to 


the question and/or completely confused about ideas 
‘Either he doesn’t study at all or, if so, 


presented.“ 
have been promoted last year.” 


he should never 


weight them, at least approximately, in terms of the amount of page 


space allotted for each answer. When a teacher allows one-half page 
for one question and a whole page for another, it is reasonable to 
assume that he considers one question to be twice as important as the 
other, Consequently, they should be given different weights in deter- 
mining total scores. j 

Weighting by page length is simple to accomplish. For example, 


Suppose that the test contains 2 one-page questions and 6 half-page 
s employed with each answer. 


0 esti é N N 
r First the numerical scale i 
ecause it is assumed that the responses on the one-page questions 


should be weighted twice as much as those on the half-page questions, 
srmer is multiplied by 2. The numerical 


eac 8 
ach numerical score on the fe 
left as they are. The two weighted 


N 9 

cores on the half-page answers are 

and the six unweighted scores are then summed. The sum is divided 
g 


by the total number of units, which equals the number of unweighted 


a ers x 1 e as ; 
es, plus 2 times the number of answers W eighted 2, plus 3 times 
he number of answers weighted 3, and so on. In the example above 


the su rat 5 -eighted numerical scores would be 
divided oe y k aa edine scores places the total score 
in the 8 ie seale uni D 1 80 arate question. 
e same scale units as that used for each separate q 

Tf teachers do not place page-space restrictions on answers (they 
should) and/or if the page-space allotment is approximately the same 
for all questions, then there is no need to apply differential weights. In 
that case, total Scores should be obtained by summing all the scores 
Siven to separate questions and dividing by the number of questions. 

Keep test responses as anonymous as possible. One of the largest 
Pitfalls in scoring essay tests is that teachers usually are considerably 


146 


Construction and Use of Teacher-made Tests 


influenced by the knowledge of the student’s name. No matter how fair 
and objective we try to be, we make marked differentiations among 
students. We are convinced that some are very well informed and 
others very uninformed, and often we are quite wrong. One purpose 
of tests is to take some of the guesswork out of evaluation, and this 
cannot be done if we mix our evaluation of the test responses with 
our over-all impression of the student. If we think the student is very 
bright, we are prone to excuse his poor answers on the grounds that 
“he really understands the subject.” Also, we are likely to let our 
impression sway us when grading the paper of a student we do not 
esteem. In that case we are prone to get hypercritical, seeing faults in 
answers that we would somehow overlook with our good students. 

If test responses are not kept anonymous, another source of bias 
comes from our personal likes and dislikes for students. Any teacher 
who says that he uniformly likes all his students is either a saint or a 
liar. Most of us get secret urges to flunk the noisy scoundrel in the third 
row and give an A to the darling little girl who brings presents and 
stays after school to help clean the debris. Sometimes we show reverse 
prejudice by giving good grades to a student purely because we are 
so on guard not to show prejudice. One of the good things about tests 
is that they protect us from our own biases. This is easily achieved 
with objective tests. With essay tests it can only be achieved if re- 
sponses are scored in ignorance of which student wrote which answer. 

One of the best ways to keep responses anonymous is to start by 
writing numbers on cards, or slips of paper, using as many cards as 
there are students. The cards are then shuffled and passed out to 
students. Students are instructed to write their names on the cards. 
Then students are told that they are not to place their names anywhere 
on the test responses but to write their number on the responses. At the 
end of the testing period students can drop their cards in an envelope. 
With this method, the teacher can score papers without knowing (for 
sure) which student gave which answer; and after the scoring is com- 
pleted, the envelope can be opened to see who made what over-all 
score, 

The only “hole” in the above procedure is that teachers often in- 
advertently recognize the author of an exam by his characteristic style 
of writing or by some other clue. Nothing can be done to prevent this 
happening occasionally. However, we are seldom perfectly sure of the 
student. We may say to ourselves, “This writing is so bad it must have 
come from Billy, but it may be Jack or Linda.” This introduces a suf- 
ficient amount of uncertainty to keep our biases from exerting their 
full force. 

Although the “cloak and dagger” atmosphere can be carried too far, 
any effort at anonymity is better than none. Even if students place 


147 


Scoring, Grading, and Reporting 


their names on their papers, some anonymity can be obtained. After 
grading the first question, with the name clearly visible, the responses 
can be folded over to hide the name. Then, when coming back to 
grade other questions, the teacher can hide the names from himself. 
Any effort at keeping names hidden will cut down on the influence of 
personal bias, and also, it generally will make students feel that they 
are being treated fairly. 

Score all the answers to one question before going on to the next. 
It is definitely unwise to score in sequence all the test responses for 
one student then move on to score all the answers for the next student. 
One thing wrong with this approach is that the scoring of a previous 
answer will influence the scoring of subsequent answers. After scoring 
the first answer, we might become peeved because the student made 
such a poor showing. If we then go immediately to his second answer, 
We are prone to be hypercritical. Scoring in sequence destroys the 
independence of grading. If questions are not scored independently, 
the reliability will be lowered. 

Another fault with grading in sequence for each students responses 
is that it makes it difficult to keep standards in mind. The teacher may 
forget some of the points which he wanted covered. 

Teachers usually shift their standards somewhat when reading 
through many answers. For a while they will be tough and then later 
they will become more lenient in their scoring. If all a student’s 
answers are graded in sequence, the student might be unlucky enough 
to catch the teacher in a tough mood. Consequently, on all his answers 
the student is treated severely. A student whose paper is scored later 
might be lucky enough to find the teacher in a more lenient mood and 
would receive good scores on all his answers. 

By far the best approach is to score all the responses to question 1, 
then score all the responses to question 2, and so on to the end of the 
test. Before starting to score all the responses to one question, it is wise 
to shuffle the papers. This will prevent students from always appearing 
near the end or near the start of the order of scoring. 

Score the papers over a period of time rather than all at once. It is 
usually unwise to score all the papers in one long sitting if for no other 
reason than that the scorer becomes too tired and too confused in the 
mass of words to do an adequate job. Even a fifteen minute break will 
allow the head to clear and give pause to reflect on the standards 
which are being applied. After scoring papers for several hours, it is 
easy to become grouchy and to take the meanness out on students. 
When the papers are set aside for awhile, the teacher might say to 
himself, “What should I expect—they are only seventh-grade kids, and 
Tam scoring like they were college students.” 

If possible, have two independent scorings made of the test. The 


148 


Construction and Use of Teacher-made Tests 


reliability of the test will be raised considerably if the teacher will score 
the papers twice, preferably with several days intervening. If this is 
done, scores for students on the first occasion should not be recorded 
on their papers but should be written on a separate sheet. The sight of 
the scores given on the first occasion will markedly influence the second 
scoring, which is another example of the failure to maintain independ- 
ence of scoring. 

When teachers perform two independent scorings of the same 
papers, they usually are surprised at the differences in scores on the 
two occasions. Seldom does a 1 score change to a 6 or even a 2 to a 5, 
but many changes of one and two points are noted. The average of the 
two sets of scores is usually much better than either of the two sepa- 
rately. If teachers had the time (they usually do not), it would greatly 
increase the reliability of essay tests to perform two scorings. How- 
ever, teachers sometimes find the time to make two gradings of 
very important tests, such as an important final examination in high 
school. 

Even better than to have two scorings by the same teacher is to have 
the papers scored by two teachers. This will go even further to remove 
idiosyncracies in scoring. The average scores given by the two teachers 
will usually be much more reliable than either teacher’s scores con- 
sidered separately. Although it is difficult to find a fellow teacher who 
is sufficiently familiar with the topic and who can spare the time, two 
scorers are much better than one, For very important essay examina- 
tions, such as those used to award scholarships or comprehensive ex- 
aminations in graduate schools, it is customary to use a number of 
scorers, often as many as six. The average scores obtained from the 
group of scorers are usually highly reliable. 


Evaluation by Grading 


After the objective or essay test is scored, how are the results to be 
interpreted? Raw scores on objective tests have little meaning without 
comparison to standards. On a forty-item test, is a score of 25 good or 
bad? It might be either, depending on the standards being used. 
Scores on essay tests are usually more directly meaningful. If a numeri- 
cal scale is used such as the one discussed previously, the average 
score for a student over the test items has some intrinsic meaning. If a 
student has an average score of 4.5, this means that his answers were 
judged to be generally quite good. In contrast, an average score of 1.8 
would be considered rather poor. But it is still not possible to com- 
pletely interpret such scores until some standards are applied. Sup- 
pose, for example, that all the students made high averages, would the 
average of 4.5 still be considered as excellent? 


149 
Scoring, Grading, and Reporting 


Before test results can be interpreted and communicated to others, 
they must be evaluated, that is, some statement must be made regard- 
ing what is good and what is poor performance. One important type 
of evaluation is called grading, which is one step removed from scor- 
ing. Grading consists in using labels, symbols, and verbal descriptions 
to indicate the quality of performance corresponding to particular 
scores, Forms of evaluation other than grading will be discussed in a 
later section. Because grading constitutes an additional step, in the 
previous two sections a careful effort was made to speak of scoring 
rather than grading. 

Students need to learn how well they performed on tests. They will 
not be satisfied to learn only that they got thirty-four items correct on 
an objective test or that they achieved an average of 3.8 on an essay 
test. They want to know whether their scores are good or bad. Period- 
ically teachers must make over-all evaluations of students for (a) their 
own information, (b) informing students of their progress, (c) re- 
porting to parents, and (d) helping to make many decisions within 
the school. To make such evaluations, teachers must evaluate (or 
grade) individual tests, exercises, and assignments, and then combine 
these into an over-all evaluation. First a basis must be adopted for 
evaluating individual tests and exercises, and then a way must be 
found to combine them. The former is the far more difficult problem. 
Once an acceptable philosophy has been adopted for evaluating indi- 
vidual tasks, the problem of combining grades is not insurmountable. 
It must first be decided what is to be evaluated and then, how it is to 
be done, What will be considered in this section; how will be dis- 
cussed in the next section. Following are some of the various ap- 
proaches to what is evaluated. 

Personal Preferences. Too often evaluation and grades are partly 
based on how well the teacher likes students. Of course, this is very 
Poor practice, and every precaution should be taken to eliminate per- 
sonal biases in making evaluations. Carefully composed and wisely 
used tests go a long way toward removing personal preferences for 


students as a basis for evaluation. 

Effort. It is tempting to evaluate students in terms of how hard 
they are trying. Of course, we are all concerned with getting students 
to do their best. If grades are based largely on effort, however, it will 
make it very difficult to communicate with others. Then parents, the 
student, and teachers are not sure what 
mean that the student knows a great deal, or does it mean that he is 


„ oecd 
showing much more effort than progress: 
Teachers should take notice of how hard the student is working and 


try to communicate this information to others. But it is better not to 
compound this with grading. Rather it would be better to make a 


a good grade means. Does it 


150 
Construction and Use of Teacher-made Tests 


separate rating of “effort” and include this in over-all evaluations of 
students. 

Attitudes. It is often said that one of the major purposes of educa- 
tion is to create desirable attitudes, and part of our classroom effort 
is aimed at that purpose. Some of the “good” attitudes that we try to 
engender are (a) liking to read, (b) respect for scholars, and (c) 
flexibility of opinions. Should such attitudes be included in evalua- 
tions? If so, they should be rated separately and not be considered in 
formulating grades. If teachers attempt to rate attitudes, they should 
be aware of how difficult that is to do. Teachers frequently do not 
know students nearly as well as they think they do, and ratings of 
attitudes often are quite unreliable. Two teachers asked to make the 
same ratings might form very different impressions, 

Better than to try to single out individual students with respect to 
their attitudes is to try to gear the over-all instruction toward creating 
desirable attitudes in all children. When the attitudes of students are 
poor, it is more often the case that teachers and schools deserve low 
grades for not doing a better job. Such attitudes vary greatly from 
class to class depending on the skills and personal qualities of the 
teachers. 

Individual Standards. Some teachers argue that it would be fairer 
to grade each student with respect to his capabilities. They rightly 
point out that a student who would receive a C grade with respect to 
brighter students may be doing as much as he is capable, and to con- 
tinually give him low grades will tend to discourage his efforts. This 
is a good point, one that will be discussed more fully in a later section; 
but it is quite confusing to try to base grades on individual standards. 
First, it is extremely difficult to decide what a student is capable of 
doing. Teachers’ impressions in this regard are not sufficiently valid. 
We might try to use intelligence tests for this purpose, but that has 
numerous pitfalls also, some of which are: (a) intelligence tests are 
not intended to be perfect measures of the ability to perform well in 
school, (b) much of the apparent difference between intelligence test 
performance and performance in school is due to sheer measurement 
error rather than to real differences in performance, and (c) to make 
such comparisons and to reach conclusions about the causes of differ- 
ences requires specialized training not had by most teachers. For these 
reasons it is best not to try to evaluate students and grade them on the 
basis of individual standards, 

Use of Knowledge. One of the purposes of instruction is to get 
students to use what they learn in the classroom. Should usage be one 
of the standards for assigning grades? The best answer is “generally 
no.” There are two good reasons why. In most subject matters it is 
nearly impossible to measure how well students use their knowledge. 


151 


Scoring, Grading, and Reporting 


In some subjects this can be done to a limited extent. For example, it 
is possible to determine whether or not students practice good spell- 
ing and punctuation in all their schoolwork. However, for most other 
subject matters this would be nearly impossible. For example, how 
would one determine how well students are “using” geography, his- 
tory, or biology? 

Even if it were possible to measure usage of knowledge, it probably 
would be unfair to let this influence grades. Suppose that a student 
knows but chooses not to use. Is this not his own business, and would 
it not be rather repressive to punish him for exercising his own choice? 
A student may do exceedingly well in English composition; but in 
writing personal letters he might misspell, commit grammatical errors, 
and use inelegant expressions. Although teachers do try to influence 
students (hopefully for the better), in the final analysis, what students 
do with their knowledge is their own business. 

Achievement. By far the most sensible basis for grading is in terms 
of actual achievement. A grade should convey to others how compe- 
tent the student is with respect to the subject matter. By achievement 
is meant not only the memory for details but, in so far as they can be 
and are measured, the many aspects of “reasoning with.” If a student 
grade in fifth-grade arithmetic, this should imply that 
erations well, that he understands arith- 
metic concepts, and that he can solve a variety of word problems. 
Specifically, the grade should mean that the student is ready for sixth- 
grade arithmetic and will probably perform well there. Although in 
many ways it may seem cold and impersonal, and it certainly leaves 
much to be said about the student, the primary basis of grading should 
be knowledge of subject matter. When this is the case, everyone con- 
cerned will understand what the grade means. 

Of course, intellectual goals are only some of the goals that are im- 
portant in education. Equally important are to broaden the horizons 
of all students, to imbue all students with a love of learning, and to 
help all students search for self-fulfillment. In spite of the importance 
of these and other goals, they should not be mixed in with other con- 
siderations in the assignment of grades. Somewhere, in some form, 
students, parents, and others need to know how well students actually 
lar topics. This cannot be accurately com- 
municated if grades are complexly determined from aptitude, attitude, 
effort, and actual achievement. Grades (as on report cards) should 
reflect actual accomplishment. The other important goals of education 
should also be reflected in evaluations. Some of these evaluations can 
be included in report cards (e.g, ratings of effort and attitudes); 
others can be communicated to students and parents during discus- 


sions. 


receives a good 
he can perform arithmetic Op 


are achieving in particu 


152 


Construction and Use of Teacher-made Tests 


How to Evaluate 


After deciding (at least the author has) to use actual achievement 
as the primary basis for assigning grades, we need to make some fur- 
ther decisions about how that is to be done. Following are the methods 
that are most commonly used. 

Arbitrary Percentages. A time-honored, but questionable standard, 
is to say that 90 is A, 80 is B, 70 is C, etc. One might rightly inquire 
“90 what?” If such standards are literally applied to objective tests, 
they result in very poor testing practices. In that case, it would be 
necessary to include many very easy items in the test to ensure that 
the average student got well over 70 per cent correct. In the previous 
chapter it was said that it is very poor practice to include either many 
very easy or many very difficult items in a test. 

On essay tests, the 90, 80, etc., standard is merely a device for com- 
municating the teacher's impressions of the quality of the work. In 
this sense it is much like the numerical scale discussed previously. 
However, the standards are usually not made explicit, and both 
teachers and students are sometimes confused in regard to what is 
meant by 78. 

On objective tests the arbitrary 90, 80, etc., standard is completely 
inappropriate and should not be used, On essay tests the arbitrary 
standard is more ambiguous and less directly useful than the type of 
numerical scale discussed previously. 5 

Rank in Class. One Way to grade students and report progress is 
simply to state how well each did with respect to the others in the 
class. This can be done with ranks, e.g., third from the top out of 
twenty-four students, or, better, with percentiles and/or standard 
scores. When this approach is followed, students actually are not 
graded at all. Grades imply values, statements of what is “good” and 
what is “bad.” The 90th percentile does not necessarily mean “good, 
but without further information, most of us would take it to mean that. 

Rank grades. One way to assign grades is automatically to give 
percentages of A’s, B’s, ete., corresponding to the ranks or percentiles 
of students. One of the most popular of such schemes is as follows: 


Grade Per cent of students 
A top 10 
B 20 
0 40 
D 20 
F bottom 10 


When this type of grading is used, the exact percentages vary 0 
teacher to teacher, from level to level within a school, and from schoo 


153 
Scoring, Grading, and Reporting 


to school. This method is referred to by students as grading on the 
curve, which means that grades are automatically determined by posi- 
tions of students with respect to one another. 

Is this a good method of grading? In its strictest form, no. Before 
we criticize this method of grading, let us look more carefully at what 
was meant in the last sentence by “in its strictest form.” If a teacher 
adopts a uniform curve which he applies year after year, then the 
method has serious faults. 

The worst fault of the method is that it does not, and cannot, take 
account of the over-all level of performance in the class. The class as 
a whole may be very dull or very bright, or may be doing very well or 
very poorly, but still (if the curve is slavishly followed) the same 
number of A’s and Fs will be given. A student who might have done 
well in one class does poorly because he is in another. This criticism 
is much more applicable to high school and college courses, where the 
levels of aptitude and effort often vary markedly from section to sec- 
tion of courses and from term to term. However, even in the ele- 
mentary grades some inequity is encountered. A student who goes to 
school in an upper-middle-class area has much stiffer competition than 
the student who goes to school in a working-class neighborhood. 
Sometimes the competition is so keen that a child who would be con- 
sidered an A student in most schools consistently gets C’s in his rarefied 
school environment. 

Another fault of slavishly grading on the curve is that it tends to 
create bad attitudes among students. The student soon realizes that he 
is in direct competition with his classmates. Consequently, he hopes 
that his friends do as poorly as possible so that his efforts will look 
good in comparison. In this environment, “curve busters” (students 
who consistently make high scores on tests) usually are envied but 
are not always well liked. 

Strictly grading on the curve makes a farce of efforts to help and en- 
courage particular students. Because, if the scheme is strictly adhered 
to, some 10 per cent, or some other fixed percentage, must fail. All the 
teacher can do by helping and encouraging one student is to boost 
him out of the lower 10 per cent, which automatically drops another 
student into the failing zone. 

Strict adherence to the curve takes no account of differences among 
teachers, Some teachers encourage and excite their students to work 
and learn far beyond what they would have done with another teacher. 
Then is it still fair to adhere strictly to a set percentage of grades in 
each category? 

ae ate fault of rank grades is that teachers use different 
curves, Some typically give about 20 per cent of the students A s, and 
others give only 5. Some teachers seldom give F's, and others typically 


154 


Construction and Use of Teacher-made Tests 


give more than 15 per cent. If all teachers used the same curve, at 
least the resulting grades would have uniform meaning, in which case 
they would serve only as a rough shorthand for the student’s rank in 
class. Because teachers use such different curves, it makes it difficult 
to interpret particular grades. 

Absolute Standards. Most of us would like to think that we have 
absolute standards, that we can, and do, judge the knowledge of stu- 
dents without recourse to any other standards. We can be heard saying 
“That is an A paper,” and “That paper deserves an F.” Although the 
use of absolute standards is, in many ways, the ideal way to evaluate 
students, let us look at some of the difficulties which currently are 
encountered in trying to put that into practice. 

Even those teachers who say that they use absolute standards usu- 
ally pay heed to the curve. Suppose that a teacher follows his absolute 
standards in grading either an objective or an essay test, and after the 
grading is completed, he finds that, by his grading, all the students 
would fail. Would he really give Fs to all his students? Even though 
there are some hardy (or foolhardy ) exceptions, most teachers would 
change their standards before posting grades. At all levels of instruc- 
tion, including that in graduate schools, it is hard to avoid some pres- 
sure to give the “usual” numbers of As, B's, and Cs. When teachers 
depart markedly and consistently from the “usual,” there often are 
subtle, and sometimes not so subtle, pressures to bring grading in line 
with that of others in the school setting. Perhaps this pressure is not 
good, and perhaps teachers should not succumb to it, but it is there 
nevertheless. 

Another problem in developing absolute standards is that of formu- 
lating reasonable bases for the standards. Older instructors often act 
as though years of teaching alone is sufficient for the mysterious ac- 
cumulation of wise and just standards. If you question teachers about 
their “absolute standards,” you usually get rather fuzzy, and often 
rather defensive, replies. 

Absolute standards are easier to set at some levels of training and 
in some types of courses than at other levels of training and in other 
types of courses. In general, the higher the grade level, the easier it is 
to apply absolute standards. To a large extent, colleges and univer- 
sities can set their own standards and adhere to them rather strictly. 
Either students meet the standards or they go elsewhere. Even further 
along, in graduate and professional schools, professors have rather 
firm standards about what constitutes adequate performance, and they 
can enforce their standards with little regard for how well the group 
as a whole performs. 

Even at lower educational levels, in some types of courses it is pos- 
sible to set rather absolute standards. This would be true in some of 


155 


Scoring, Grading, and Reporting 


the vocational courses in high school. For example, the teacher knows 
that if the student cannot correctly type thirty words per minute, she 
is not going to make out as a typist. Consequently, the teacher can 
allot grades with little regard for how many A’s, B’s, and C’s are given. 
The same is true of other vocational courses. However, for most 
courses in high school, and nearly all those in elementary school, it is 
very difficult to establish absolute standards. 

The discussion in this section is not meant to deride the idea of ab- 
solute standards, because, in fact, the idea is quite appealing. What 
is meant to be conveyed is that (a) even those teachers who say they 
use absolute standards usually employ other criteria as well, (b) in 
school settings it is very difficult to stick to one’s own absolute stand- 
ards, and (c) it is often quite difficult to provide a reasonable basis for 
the absolute standards which are used. 

Compromise Procedures. Admittedly the use of grades as a form 
of evaluation presently poses many knotted problems, and there are 
legitimate differences of opinion about what should be done. One solu- 
tion is to not use grades at all and, instead, to rely on other forms of 
evaluation, some of which will be discussed in a later section. 

The question of whether or not to use grades, and if so, how, de- 
pends on a number of considerations. One relevant consideration is 
the use that will be made of evaluations. For example, teachers may 
want to use letter grades for their own information but not report 
these to students and parents. 

Another consideration is the level and type of course. As was men- 
tioned previously, absolute standards can be applied more easily at 
upper levels of education, and particularly in certain types of courses. 


In other types of courses and at lower levels of instruction, rank in 


class usually must be given strong consideration. 

At most levels, and in most cases, when grades are determined and 
reported to students and parents, they usually are jointly compounded 
of rank in class and absolute standards. Most teachers try to use ab- 
andards only, but they temper this somewhat by how well the 


2 whole performs. 


Reporting Grades 
the previous section, the use of grades depends 


As was mentioned in 
) for whom the information is intended. Fol- 


in part on the person(s 
lowing are some of the most prominent uses of grades. 

Teachers. One of the most prominent uses of tests is to supply 
teachers with information as to how well students are doing. Without 
difficulty in managing the instruction 


this information they will have 
ar needs of students. If tests served no 


and in gearing it to the particul 


156 


Construction and Use of Teacher-made Tests 


other purpose than to help teachers make day-to-day decisions, they 
would be well worth the effort. 

When evaluation is used to supply information to teachers, many of 
the complex problems which were mentioned previously do not oc- 
cur. Mainly, the teacher needs to know rank in class in order to give 
special help to those who need it most. In addition, the teacher needs 
to form at least an approximate notion of how well the class as a 
whole is progressing. This can be obtained either by forming an im- 
pression of the over-all test results or (better) by trying to adhere to 
absolute standards in allotting grades. The latter would mean taking 
seriously the average grade given on essay questions and the average 
number of items correctly answered on objective tests. This is usually 
sufficient to give the teacher a good impression as to how well the 
class as a whole is performing. 

Students. The problem is more difficult when considering how to 
report evaluations to students. Because reporting letter grades to stu- 
dents does often discourage those who are less capable, and because 
it often motivates students to strive for good grades as an end in itself, 
some have proposed that grades should not be reported to students. 
One proposed alternative is to let students evaluate their own progress. 
Although this appeals to our democratic urges, it simply does not 
work, Students are often grossly ignorant of how well they are doing 
until, by one means or another, the teacher evaluates their work. Even 
college students often are surprised when they learn how well or how 
poorly they are doing in particular courses. 

Students need to learn how well they are doing, and by some method 
or another, they have to be told by the teacher. In doing this, rank in 
class alone is probably not the most useful thing to report. Similarly, 
grades based directly on rank in class (grading on the curve) are lit- 
tle better than the ranks themselves. Probably the best compromise is 
(as was advocated previously) to use letter grades based partly on 
rank in class and partly on absolute standards. This will allow the 
flexibility that is often needed to (a) show the students as a group 
that they are doing relatively well or relatively poorly, (b) show the 
class that they are performing very differently, or very much the same, 
as one another, and (c) indicate to individual students how well the 
teacher thinks they are progressing regardless of their rank in class. 
Admittedly, using letter grades in this way to report progress to stu- 
dents has many drawbacks, but at the present time, no better com- 
promise solution is available. 


The skilled teacher has many opportunities to supplement letter 
grades with less “cold” types of communication. He can let less capable 
students know that he likes them even if they never do A work; he can 
(when it is true) compliment them on their nonintellectual achieve- 


157 


Scoring, Grading, and Reporting 


ments; and he can encourage them to work for their own self-enhance- 
ment. Similarly, the teacher can let bright students know that even 
though they stand near the top of the class, they can do much better. 
he can try to bring out the best in every student and 


In these ways, 
make learning fun, regardless of grades. 

Parents. Probably the most difficult problem arises when reporting 
evaluations to parents. It is often the case that students are more will- 
ing to accept their victories and defeats than are the parents. It is very 
hard for many parents to accept the fact that C work is the best that 
their child can do, and other parents are disturbed if their child occa- 
sionally drops from A to B. Some have suggested that parents them- 
selves should evaluate the progress of their children without other 
forms of evaluation being used, If students are often poor judges of 
their own progress, parents are notoriously poor judges of how well 
their children are performing. If this were the only form of evaluation 
had by parents, they usually would be grossly ignorant of how well 
their children were doing in school, and they would be in a poor posi- 
tion to help their children or help plan for their future schooling. 

Although the old-fashioned report card has many faults, no one has 
yet found a way to do away with it. Some of the bad attitudes which 
report cards can engender are widely known and have been men- 
tioned in previous sections. Perhaps rather than do away with report 
cards, it would be well to modify them in such a way as to (a) reduce 
some of the bad effects and (b) retain the necessary communication 
of how well students are progressing in school. 

One way to convey evaluations is in relation to grade level of work. 
A much used procedure is to report “working at grade level,” “work- 
ing below grade level,” or “working above grade level.” The three- 
category scale is applied separately to different topic This method of 
reporting progress has some advantages over the traditional A, B, C, 
ete., system, particularly in the elementary grades. It is a less personal 
system and more accurately reports the standing of the student. At the 
elementary levels, grade-level progress is easily understood by parents 
and others. However, in high school it is probably wise to use the A, 
B, C system instead. Grade level has little meaning in high school. For 
example, in a course like physics, which is offered only in high school, 
and for which there are no preceding courses, grade-level progress 
makes little sense. Also, by the time the student reaches high school, 
he has formulated a reasonably good idea of his capabilities and is 
not as “hurt” by poor grades. By this stage of training, finer distinctions 
are more necessary in evaluation, e. g., to help students and parents 
decide whether or not application for college training should be made. 
it is good to include a number of types of in- 
atings can be made of effort, social 


In using report cards, 
formation in addition to grades. R 


158 
Construction and Use of Teacher-made Tests 


behavior, work habits, attitudes, and others. Although, as has been 
mentioned a number of times, such ratings have their limitations, they 
do provide a means of communicating about students who are quite 
extreme in some respect, e.g., the child who is highly withdrawn and 
shy. 

A number of practices have been recommended to supplement the 
use of report cards. Unfortunately, all these add many more hours of 
work to already overworked teachers. One of these is for teachers to 
write letters discussing the progress of individual students. This is 
sometimes advisable, and feasible, for a few outstanding students, out- 
standing either because they are having a great deal of trouble or be- 
cause they are making excellent progress. However, if teachers had to 
write letters periodically to all the parents, it would be a very time- 
consuming activity. Also, unless such letters are carefully composed, 
parents will read between the lines things that were not intended, and 
it woud be better if they had received no communication. 

One good way to supplement report cards is with parent-teacher 
visits, preferably in the school. If the teacher is skillful at communicat- 
ing with parents, he can create a great deal of undersianding and 
good will in a short discussion. 


Combining Scores 


The final grade in a unit of instruction is usually derived from a 
number of tests, out-of-class reports, and exercises. Consequently, it is 
necessary to develop sensible methods for effectively combining the 
different sets of scores, Elaborate statistical methods (some of them 
straight out of Rube Goldberg) have been recommended for combin- 
ing scores. These will not be described because (a) the reader would 
drown in the statistical elaborations, (b) practically no teachers (in- 
cluding the author) would ever go to the trouble to use them, and (c) 
they are no more logical than some simpler methods. Instead, some 
easily applied methods of combining scores which are quite adequate 
will be described. 

Combining Objective Tests. If the final grade is based entirely on 
the results from several objective tests, by far the most reasonable 
practice is to add all the raw scores from the several tests. If, for ex- 
ample, there are two tests with forty items each and a final examina- 
tion with eighty items, the maximum possible total score is 160. A stu- 
dent with scores of 25, 30, and 65, respectively, on the three tests 
would receive an over-all score of 120. Because the tests differ in 
length, it might be questioned why the different tests should not be 
weighted accordingly. The answer is that, because scores will tend to 
disperse themselves more on the longer tests, simply adding all scores 
automatically gives each test an approximately correct weight. 


159 
Scoring, Grading, and Reporting 


If the final grade consists only of rank in class, this can be easily ob- 
tained from the summed scores. Students can be ranked from top to 
bottom with respect to summed scores, and if it is desired, the ranks 
can be converted to percentiles. If in addition to rank in class, it is de- 
sired to compute rank grades, the curve can be determined directly 
from the ranks by giving the top specified percentage A, the next 
specified percentage B, and so on. 

If final grades are meant to reflect absolute standards as well as 
rank in class, it is necessary to define the scores that correspond to 
each grade level. That is, on a forty-item test, the teacher must say 
how high a score will be required for an A, a B, ete. When the teacher 
first tries to formulate such standards, it will seem like sheer guess- 
work. After some experience with objective tests, teachers become 
more confident about their standards. Some guidelines can be given to 
the new teacher. 

At one extreme it is unreasonable to expect even very bright stu- 
dents to get all, or even nearly all, of the items correct. Even the best 
teacher-made test has at least several relatively ambiguous items which 
would lead even experts to give “wrong” answers, Even without this 
consideration, it is not expected that any student will completely 
master a subject matter. Consequently, the A level is usually set at 85 
or 90 per cent of the items correct. At the other extreme, students often 
should be given a failing grade even when they get some of the items 
correct, Even the most inept, lazy student will usually do better than 
could be obtained from sheer guessing alone. Consequently, the mini- 
mum level for a passing grade is usually set at about 40 or 45 per cent 
of the items correct. (In this connection, remember that, purely by 
guessing, students could get some of the items correct, the number 


a p 1 ne * 
being inversely proportional to the number of alternative answers for 


each item.) In between these two extremes it is hard to be precise in 
designating grade levels, and probably not worth the trouble to be 
highly precise. What is usually done is to divide up the number of 
items between the failing zone and the A zone into three approxi- 
mately equal parts and designate these as the B, O. and D zones, 
respectively. One such scheme for allocating grades in terms of ab- 
solute standards (illustrated with a forty-item test) is as follows: 


Number of items Per cent 
correct correct Grade 
34 or more 85 A 
29-33 72 B 
24-28 60 Cc 
19-23 48 D 
F 


18 or less 45 


160 
Construction and Use of Teacher-made Tests 


The scheme above is purely illustrative and is not intended as a 
standard to be used in general. However, it is interesting to note that, 
if comparisons are made among the standards used by different teach- 
ers, they tend not to vary greatly from the scheme shown above. 

In using a grading scheme such as the one shown above, students 
should be informed that the scheme strictly applies only to the final 
summed scores, after the several tests are combined. That is, after 
scores from the several tests are summed, the teacher gives A’s to stu- 
dents who get 85 per cent of the items correct on all tests combined, 
and so on for the other grade levels, Basing grades on the percentage 
of items correct on final summed scores will save the teacher from 
getting into many knotty problems about how to combine the letter 
grades from the separate tests. 

In using grade standards such as the one illustrated above, teachers 
should be aware of the effect of regression toward the mean. In look- 
ing at the results of the first test, teachers and students are often 
fooled into believing that approximately the same percentage of stu- 
dents will get A’s at the end of the term as appeared on the test, that 
the same percentage will fail, and that approximately the same per- 
centages will be in the other grade categories. This definitely is not the 
case. The reason that it is not the case is that the group which gets 
A’s on the second, and other tests, will not be quite the same as the 
group which made A’s on the first test. Also, the students who fail the 
second test will, as a group, be somewhat different from the group 
who failed the first test. Some sample results will help illustrate the 
effect of regression toward the mean, In this case imagine that three 
tests are given, that a fixed per cent of students make grades in each 
letter category on each test, and that the three tests are summed to 
determine final grades. 


Per cent of items Per cent of students Per cent of students 
correct on each test on summed tests 
85-100 (A) 10 4 
72-84 (B) 20 18 
60-71 (C) 40 56 
48-59 (D) 20 18 
0-45 (F) 10 4 


As the results above show, even though 10 per cent of the students 
reach the A level on each of the separate tests, only 4 per cent con- 
sistently score high enough to reach the A level on the summed scores. 
At the other extreme, 10 per cent fail on each separate test, but only 
4 per cent consistently score so low that they receive a final grade of F. 

The lesson that should be learned from the illustration above is that 


161 


Scoring, Grading, and Reporting 


in order to have a proper balance of final grades, it is necessary to be 
somewhat more extreme in the grading of individual tests. On individ- 
ual tests, teachers should be relatively lenient in giving Bs and A’s and 
relatively severe in giving D's and Fs. Many of the students who re- 
ceive extreme grades, in either direction, on the separate tests will end 
up with averages nearer the middle of the grade range. In trying to 
establish absolute standards (and this discussion illustrates some of 
the difficulties in so doing), the teacher should think in terms of what 
grades should be given for consistent performance at particular levels. 

Combining Essay Tests. If the final grade is based solely on essay 
tests, it is not very difficult to combine the results from separate tests. 
Whereas it was recommended to simply average the results of objec- 
tive tests, the results of essay tests should usually be weighted before 
averaging. The most sensible way to weight is in proportion to the 
amount of page space allotted for answers on each test. If four tests 
are given during the term, three of these requiring three pages of 
answers and a final exam requiring six pages of answers, the final test 
should be counted twice as much as the others. The tests are averaged 
in exactly the same way as one averages the numerical scores given to 
the separate questions within one test. In this case the numerical scores 
for the 3 three-page tests would be summed, and to this would be 
added twice the numerical score obtained on the six-page test. The 
total should then be divided by five in order to place the final scores 
on the same scale used to grade each of the separate tests. 

Regression toward the mean occurs in combining the numerical 
scores from essay tests in exactly the same way as it does on objective 
tests, Consequently, teachers should grade more toward the extremes 
on separate tests than they would if only one test were given all term. 

Even if separate questions are graded on a scale with 5 correspond- 
ing to A, 4 to B, 3 to C, 2 to D, and 1 to F, this still does not tell the 
teacher exactly how to allot grades on the basis of summed test re- 
sults. Tt is not reasonable to expect the A student to get all 5s, nor 
would the F student be exempted from failing merely because he got 
scores higher than 1 on several questions. One approximate scheme for 
translating average numerical scores on separate tests into letter grades 


is as follows: 


Letter grade Score average 
A 4.5 or more 
B 4.0-4.4 
Cc 3.0-3.9 
D 2.0-2.9 
F 1.9 or less 


The grading scheme can be applied both to the separate tests and to 


the final weighted average. 


162 
Construction and Use of Teacher-made Tests 


Combining Results from Different Types of Tests. More difficult 
problems are encountered when trying to combine the results of essay 
and objective tests, and combining both of these with reports and 
themes. Probably the most feasible approach (within the limits of 
time and energy that can be expected of teachers) is to try to reduce 
all the different types of scores to the type of numerical scale recom- 
mended for grading essay questions —5 for A, 4 for B, etc. The grading 
scale can be directly applied to lab reports, themes, term papers, and 
other such materials that are used in addition to tests for determining 
final grades. 

It is somewhat more difficult to generate a thoroughly logical 
scheme for transforming results of objective tests to the five-point scale 
in order to facilitate combining with other indices. A reasonable ap- 
proximation is to transform objective test results by giving the student 
a score on the five-point scale approximately corresponding to the let- 
ter-grade equivalent made on the objective test. Using this method to 
compare the scheme shown previously for grading objective tests and 
that shown for grading essay tests, we see that the following trans- 
formations would be made: 


Per cent of items Corresponding 

correct on objective score on five- Letter-grade 
lest point scale equivalent 
85 4.5 A 
72 4.2 B 
60 3.0 H 
48 2.0 D 
45 1.9 F 


If more than one objective test is used during the term, it would be 
best to make the transformation on summed scores rather than on 
scores obtained from the separate tests. Then, if a student got 85 per 
cent of all the objective items correct, this would be transformed to 
an equivalent of 4.5 on the five-point scale. A student who got 48 per 
cent of the objective items correct would receive an equivalent score 
of 2.0. Unfortunately, this method of transforming results of objective 
tests runs into a snag when, as will surely happen, students get “odd 
percentages of the items correct, e.g., 87 per cent and 64 per cent. In 
these cases the only thing that can be done is to make an interpolation 
of the objective test results to the nearest equivalent on the five-point 
scale. Either teachers should try to make the best interpolation “by 
eye” or do some simple arithmetic. A method for doing this is de- 
scribed in Appendix C-9. 


163 


Scoring, Grading, and Reporting 


After all types of tests and exercises have been transformed to the 
five-point scale (or some other type of scale which the teacher pre- 
fers), it is much easier to combine results and arrive at a final grade. 
In deciding on the weights to be applied to the different sources of 
evaluation, teachers have to rely mainly on their own judgment. One 
clue is the amount of class time used for each type of test. For ex- 
ample, if two hours during the term were used for objective tests and 
four hours were used for essay tests, this might suggest that the essay 
tests should be counted twice as much. In weighting laboratory reports, 
themes, and other exercises, teachers have to rely almost solely on their 
judgment regarding the relative weights that should be given. A 
typical set of results is as follows: 


Average scaled scores 


Objective tests Essay tests Reports Final average Grade 


3.4 3.0 3.8 3.4 Cc 
4.6 4.3 4.0 4.3 B 
2.1 1.6 1.4 1.7 F 


In the example above, the three sources of evaluation were weighted 
equally to simplify the illustration. The first student receives a total 
average of 3.4, which is in the C zone; the second student receives an 
average of 4,3, which is in the B zone; and the third student receives 
an average of 1.7, which is in the failing zone. , , ; 
Tf the purpose of combining measures is to obtain rank in class, this 
can be obtained from the final averages. If rank grades are given, the 
predetermined percentages can be obtained from the ranks, and these 
can be converted to letter grades. If teachers want to give literal inter- 
pretations of their absolute standards, they can assign letter grades 
Corresponding to the five-point scale, as was done in the example 
above, But, because of the difficulties in formulating absolute stand- 
ards and the difficulties of implementing these in the form of test 
Scores, most teachers temper the absolute results with some considera- 
tion for how well the group as a whole performed. If after seeing that 
Students as a group do poorly, and if, after thinking about it, the 
teacher thinks that he has been too rough in his grading, he would 
Want to be somewhat more lenient. Conversely, he might decide to be 
a little tougher before assigning final grades. When teachers taka such 
Considerations into account (and nearly all do), the standards are con- 
sidered as zones of uncertainty rather than as exact points. If a stu- 
dent receives a final average of 4.8, he obviously has performed so well 
that he deserves an A. But a student whose final average is 4.5 might 


164 
Construction and Use of Teacher-made Tests 


be given an A or a B depending on how well the group as a whole 
performs. Similarly, a student with a final average of 1.8 might receive 
either an F or a D depending on the performance of the total group. 

What usually happens when trying to implement absolute standards 
is that the standards help greatly to reduce the uncertainty regarding 
what grades should be given. But they do not entirely remove the un- 
certainty, because teachers inevitably temper their standards by (a) 
how well the group as a whole performs, (b) afterthoughts about how 
easy or difficult the tests were, and (c) what other teachers con- 
sidered to be “usual” standards for grading. 


A Realistic Outlook on Evaluation 


In discussing how tests should be composed, scored, and evaluated, 
it is easy to be carried away into assuming that all teachers are highly 
expert in these matters and that they can devote full time to them. 
Because these assumptions are obviously incorrect, it is proper for us 
to realistically analyze the place and scope of testing and evaluation 
in everyday classroom instruction. 

In a sense it is easy to take one’s own evaluations of students too 
seriously, Our tests are not perfect; our gradings of them are not in- 
fallible; we combine grades from separate tests and reports in ways 
that are not without some elements of subjective bias; and during the 
short periods in which we instruct students, they are growing and 
changing. 

Of course, in one sense, teachers should be very serious about eval- 
uations, They should construct the best tests that they can, grade them 
as fairly as possible, and try to report in such a way as to do the most 
good for everyone concerned. However, teachers should be modest 
and frank enough to realize, that, even with the best efforts, evalua- 
tions provide only a somewhat blurred image of something (the stu- 
dent) which is growing and changing. 

Teachers, students, and parents should learn to take the results from 
one test, and even the final grades from a whole term, with a large 
grain of salt. Such grades should be considered as only highly tenta- 
tive indications of the student’s basic abilities, his application to school- 
work, and his attitudes toward learning. Bad grades during one term 
may correctly spell trouble for the future; or they may equally well 
mean that the teacher was biased in grading the student, that the 
tests were poorly constructed, that the teacher has unreasonably tough 
standards for grading, or that the student is going through a “phase 
which he will outgrow later. 

Considering the potential weaknesses of all forms of educational 
evaluation, there are two principles that teachers should hold firmly 


165 
Scoring, Grading, and Reporting 


in mind. First, it is relatively safe to make “strong” evaluations, and, 
if necessary, communicate these to parents and others, for only the ex- 
tremes. One can feel relatively sure of the evaluation placed on stu- 
dents who always get A’s. At the other extreme, the student who con- 
stantly fails in some, or most, of the school topics is obviously in real 
trouble. For both extremes it is important to give special help to stu- 
dents, often as much or more so for the very bright as for the very 
dull, or very maladjusted, child. When evaluations are constantly ex- 
treme, it is safe to take the results seriously in managing day-to-day 
classroom activities, in reporting progress to the school, and in dis- 
cussing the student's progress with parents and others. For the 90 per 
cent of the children who do not fall into the extreme categories, it is 
Wise to go very slowly in making “strong” evaluations to parents and 
others, or in taking “strong” actions based on those evaluations. 

Second, until abilities become more crystallized in later years of 
high school, and until an accumulated record of evaluations is avail- 
able, it is wise to maintain a posture of “wait and see” with nearly all 
students. Time is on the side of more accurate evaluation; and since in 
the early elementary grades there usually is no rush to make important 
decisions about students, it is wise to wait and see what time will 
show, 

Unless students in the ele 
is better not to arouse eithe 
students and parents. For the 
best, until considerable experience 
selves “this is a normal, healthy child 
Work in school.” 


mentary grades are obviously extreme, it 
r undue pessimism or excessive pride in 
great bulk of in-between students it is 
indicates otherwise, to say to our- 
apparently doing satisfactory 


Suggested Additional Readings 


Strang, Ruth, How to report pupil progress. Chicago: Science Research, 1955 A 

Thomas, R. XI. Judging student progress. (Rev. ed.) New York: Longmans, 1960. 

Thorndike, R und Hagen Elizabeth. Measurement and evaluation in psychol- 
Feet g.) New York: Wiley, 1961, chap. 17. 

ogy and education. (2nd ed.) New York: Wiley, 1961, l ' 

Wood Dorothy A. Test construction. Columbus, Ohio: Merrill, 1960, chap. 8. 


III 


Standardized 
Achievement Tests 


In terms of sheer numbers, standardized achievement 
tests outrank all other commercially distributed edu- 
cational and psychological tests. In terms of impact 
on the classroom, standardized achievement tests are 
second in importance only to tests which the teacher 
himself develops for use with his own students. 

The word “standardized” means that the tests are 
carefully constructed by experts for use throughout 
the country. Directions for administration, scoring, 
and interpretation are made explicit; and norms are 
provided which make possible direct comparisons be- 
tween students of different ages and between students 
in different schools and localities. Some of the pur- 
poses, principles, and methods for constructing and 
using achievement tests will be discussed in Chapter 8. 
One primary distinction among achievement tests is 
that between comprehensive tests and tests for indi- 
vidual topics. Comprehensive tests contain subtests for 
all, or most, of the important topics taught at particu- 
lar grade levels. For example, a comprehensive test 
for the sixth-grade level typically would contain sub- 
tests for reading, language, arithmetic, science, social 
studies, and others. Individual achievement tests are 
at parts of the total subject matter, such 


aimed only 
geography only, or chemistry only. 


as reading only, 
Another distinction among achievement tests is 
that between survey tests and diagnostic tests. Survey 
tests are intended to measure how much students 
know; diagnostic tests are intended to “look inside” 
the student's performance in such a way as to provide 


168 


Standardized Achievement Tests 


clues about work habits and particular faults. As will 
be discussed more fully later, the distinction is more a 
matter of degree than of kind. The diagnostic meas- 
ures are all tests of separate topics rather than com- 
prehensive batteries. They are restricted almost exclu- 
sively to reading and arithmetic. Comprehensive 
achievement tests will be discussed in Chapter 9; sur- 
vey and diagnostic measures for separate topics will 
be discussed in Chapter 10. 


chapter 


Construction and 
Use of Standardized 


Achievement Tests 


Miss Martin is near the end of her first year of teaching, and she is 
wn by her fourth-grade children. Now, 


early in May, it is time for her students to take a comprehensive 
achievement test. She has heard a lot about the tests from other 
teachers and has learned that the tests are considered quite important 
in the school. She hopes that her students will perform well and that 
the tests will confirm her judgments about the relative standings of her 
students with respect to one another. “Will the tests show that my 
students as a group have done well in the fourth grade? Will they 
confirm my opinion that Bill Harris is an outstanding student and that 
Anne Blackman should be held back in the fourth grade? Will the new 
approach that I used in the teaching of number skills ‘pay off on the 
tests?” 

Purpose of Achievement Tests. The purpose of achievement tests is 
to measure progress in school up to a particular point in time. The 
Purpose of a comprehensive battery given at the end of the fourth 
grade is to determine how well students have mastered school topics 
such as reading, number skills, spelling, grammatical usage, and social 
studies. Standardized achievement tests are natural outgrowths of 
teacher-made tests. They share the same standards of validity, and they 
are both aimed directly at school-learned information and skills. The 
most important differences between standardized achievement tests 
and teacher-made tests are as follows: 

l. Coverage. Standardized achievement tests usually cover much 
More material than that in most teacher-made tests. Typically, a 
teacher-made test is intended to measure progress in learning number 
skills for a one-month period or progress in ancient history over a 
school semester. For example, whereas an achievement test would have 
a section dealing with the Revolutionary War in general, a teacher- 
made test might concern much more highly specialized aspects of the 


rather proud of the progress sho 


169 


170 


Standardized Achievement Tests 


war, such as the activities of Indian tribes in a particular locality. Com- 
prehensive achievement batteries measure progress in all, or most, 
important topics up to a particular point in time. The teacher-made 
test often provides more detailed information about how well the 
student has mastered a particular topic; the achievement test provides 
a picture of the student's over-all educational development. 

2. Objectives. In addition to having a broader content coverage, 
the kinds of content in achievement tests tends to differ somewhat from 
that in the teacher-made test. The teacher-made test is specifically 
aimed at local objectives, those of the teacher and the school as a 
whole. Achievement tests are based on the core educational objectives 
shared by educators across the countr Consequently, some materials 
on the achievement test may not be considered very important by a 
particular teacher, and some of the types of material that he considers 
important may be given little, if any, coverage in the achievement test. 
Rather than this imposing a conflict of objectives, if properly under- 
stood the partial difference in objectives means that achievement tests 
provide a valuable supplementary source of information to teachers 
and whole schools. 

3. Construction. Perhaps the most important difference between 
achievement tests and teacher-made tests is in terms of the relative care 
and expense given to the two types of tests, It would be foolish not to 
frankly recognize that few teachers are truly expert in matters relating 
to educational measurement and that they have only a limited amount 
of time to construct and use tests, In contrast, commercially distributed 
achievement tests are constructed by experts, who may take as long as 
several years to develop a new test. Because of the resources available, 
very careful plans can be made for the content of the tests, much time 
can be devoted to the construction of items, and elaborate empirical 
investigations can be made of the quality of individual items and whole 
tests. Teachers should not be so foolishly proud as to ignore the fact 
that, item per item, standardized achievement tests usually are much 
more carefully constructed than are teacher-made tests. 

4. Norms. One of the major advantages of achievement tests is that 
they provide norms for comparing individual students, classes, and 
whole schools with the school progress shown by students in schools 
across the country and with students in individual states and geo- 
graphical regions. Such norms are obtained as part of standardizing the 
tests. Teachers seldom have such norms available for their own tests. 
Typically, they test only twenty or thirty students, and the most they 
can do is to compare students with one another or with the scores made 
by students in previous years on similar tests. Norms are useful in 
making many comparisons, such as comparing the mathematical 
achievement of students in all the high schools within a city, comparing 
the level of reading ability of children in one school system with that 


171 
Construction and Use of Standardized Achievement Tests 


in the country as a whole, and comparing the levels of attainment in 
number skills for a student at the ends of the fourth and fifth grades. 
It is in helping to make such comparisons that standardized achieve- 
ment tests have their particular advantage over teacher-made tests. 

5. Uses. Because of differences in the scope, content, construction, 
and normative data, teacher-made tests and standardized achievement 
tests are intended to supply somewhat different kinds of information 
for use in making educational decisions. It is unfortunately the case 
that some teachers regard standardized tests as “competitors” to their 
own tests and informal observations about pupils. Actually, the two 
types of measures are intended to serve largely different functions. 
There are some areas of overlap, e. g., making decisions about the 
promotion of students, but these are small compared with the relatively 
unique advantages of each. The particular uses of achievement tests 
will be spelled out in detail later in the chapter. 

Validity of Achievement Tests. In Chapter 2, three kinds of validity 
were described, depending on the functions which particular instru- 
ments are intended to serve: prediction, assessment, and trait measure- 
evement tests are primary examples of assess- 
ments. Their validity depends on exactly the same standards as do 
teacher-made tests. Assessments are valid if their content is representa- 
tive of a particular unit of instruction or an over-all course of training. 
The validity of predictors and trait measures rests largely on empirical 
0 former on the correlation of a test or test 
battery with a criterion, and the latter on a complex of correlations 
with numerous measures. To study these two types of validity, experi- 
ments must be undertaken and statistical analyses made of the results. 

As was mentioned previously, some empirical and statistical proce- 
dures are helpful in validating achievement tests, but those are not the 
primary standards. It is inevitably the case that the validity of achieve- 
ment tests and teacher-made tests rests on “rational” considerations. 
Only by the exercise of expert judgment can it be told whether or not 
an achievement test faithfully adheres to the goals of instruction. Be- 
cause of the availability of experts and other resources, it usually is 
possible to guarantee a higher level of content representativeness in 
standardized achievement tests than in teacher-made tests. This is ac- 
complished by carefully planning the content of standardized achieve- 
Ment tests in conjunction with a representative cross section of edu- 
a kind that are most likely to measure 
and by systematically gathering opinions 
f the tests. How these steps are under- 


ment. Standardized achi 


and statistical procedures, the 


cators, composing items of 
important aspects of learning, 
from teachers about the quality o 
taken will be described in a later section of this chapter. 

Aptitude and Achievement. An issue that has been raised pre- 
viously in this book, and must be raised again later, is that of the 
difference between aptitude and achievement. Ideally, aptitude is the 


172 


Standardized Achievement Tests 


capacity to learn, a forecast of how much students can achieve under 
favorable conditions. Achievement is how much students have learned 
up to a particular point in time. Another way of saying it is that apti- 
tude tests are meant to be predictive of future achievement, and 
achievement tests are meant to assess the actual level of attainment. 

It is easy to see that presently we have no way of measuring aptitude 
entirely apart from past achievement, and, indeed, the two are logically 
semi-inseparable. We have no way of “looking inside” people to gauge 
their intelligence and special aptitudes. Rather, we must judge apti- 
tude by how well people have mastered their cultural environments 
up to particular points in time. Consequently, many of the items on 
aptitude tests, e.g., tests of intelligence, are similar to items on tests of 
achievement. For example, vocabulary tests usually are present in both 
types of measures, Also, however one conceives of aptitude, it is 
reasonable to believe that it is influenced by the richness of past school- 
ing and other experiences. If, in some sense, two children are born 
with the same level of aptitude, their aptitudes should be changed by 
what occurs during the years of maturation. Most properly, aptitude 
should be thought of as aptitude for the next step in education rather 
than as some innate and unchanging characteristic of the student. 

In spite of the theoretical and practical difficulties of measuring apti- 
tude somewhat apart from achievement, the distinction is too well 
rooted in common sense to be done away with completely. When Bill 
is taunted by one of his friends about his dog's inability to perform 
tricks, Bill says “He ain't dumb; he just aint had no education.” At a 
simple level this is recognition of the difference between what an 
organism might accomplish under favorable conditions and what it has 
accomplished. The same type of distinction is made by teachers every- 
day. We hear this when a teacher says, “If he wanted to he could 
make all A’s.” It is heard in another form when a teacher says, “He 
actually has less ability than one would judge from his grades; he is 
knocking himself out to keep up with the class”; and it is heard in still 
another form when a teacher says, “He is not failing because of lack of 
ability but because of his disturbed personality.” 

The distinction between aptitude and achievement is too important 
and too well founded in practical experience to be ignored, Admittedly 
teachers are prone to overemphasize the distinction between the two. 
Seldom does the student with high aptitude make poor grades even if 
he is under adverse conditions in the classroom and at home. Con- 
versely, seldom does the “slow learner” begin to make excellent grades 
after even the best of individual attention. However, moderate, and 
sometimes even marked, differences are found between measures of 
aptitude and achievement; and when they are, they are of real diag- 
nostic importance, For this reason, it is important to do the best job 


173 


Construction and Use of Standardized Achievement Test 


possible of measuring aptitude somewhat apart from achievement. I. 
this part of the book will be discussed principles relating to the meas 
urement of achievement; in Part IV of the book will be discusses 
principles relating to the measurement of aptitudes. 


Construction of Achievement Tests 


The principles and methods for constructing standardized achieve 
ment tests are essentially the same as those described in Part II fo 
the construction of teacher-made tests. First, the purposes of the in 
strument must be clearly decided. Second, a detailed outline is mad 
of the content to be included. The statement of purposes and the out 
line of content are discussed wih educational experts and with class 
room teachers. Such discussions usually result in clarifications an 
changes in both the statement of purposes and the outline of conten! 
Next, items are composed for each part of the outline of content, an 
these are inspected by numerous persons for their clarity, representa 
tiveness, and importance. In this process, some items are discardec 
some modified, and others added. The items are then administered t 
large numbers (usually more than a thousand) of students, Iter 
analyses are then made along the lines discussed in Chapter 7. Th 
items that meet the necessary statistical requirements are used to fon 
the actual test. The test is then administered to thousands of childre 
across the country who form a representative sample of all thos 
students with whom the instrument will be used, e.g., a sample 
fourth-grade children. Statistical analyses are then made to obtai 
norms for the country as a whole and for separate geographic: 
regions, states, and local school systems. The norms usually are r 
ported in several different forms, including percentiles, standard score 
transformed standard scores, grade equivalents, age equivalents, an 
others, Finally, after all these things have been done, manuals mu 
be carefully written for the administration, scoring, and interpretatic 
of the test. The test is then ready to be placed “on the market.” Fro: 
the foregoing, it is obvious that the construction of a new standardize 
Measure of achievement requires a great deal of time, money, r 
search, and expert attention. K 

Outline of Content. The outlining of content for achievement tes 
does not differ in principle from that used by the teacher in composir 
his own examinations. However, in practice there are several featur 
that usually distinguish the two. First, the outline of content f 
achievement tests is usually much more carefully constructed than th 


for teacher-made tests. Second, because the coverage is broader 
the outline must also be more extensive. Thir 


lual teacher for his examinati 


achievement tests, k 
individ 


whereas the outline used by the 


17. 


4 


Standardized Achievement Tests 


is usually determined by himself alone, the outline for an achievement 
test must be cooperatively derived by educational experts, educational 
administrators, and teachers. The following outline of content for the 
reading and arithmetic sections of the California Achievement Tests, 
Elementary Battery (78) will illustrate the care and detail required of 
outlines. (Typical items employed in achievement tests will be 


such 


shown in the next chapter.) 


Reading 


I. Reading comprehension 


II. 


III. 


A. Word form 
1. Lower case words 
2. Capitals 
3. Miscellaneous type faces 
B. Word recognition 
1. Gross differences 
2. Initial sounds or endings 
C. Basic vocabulary 
1. Opposites 
2. Similarities 
Reading comprehension 
A. Following specific directions 
1. Simple directions 
2. Directions requiring simple choice 
3. Reading definitions and following directions 
B. Interpretation of meanings 
. Selecting topic or central idea 
Understanding directly related facts 
Making inferences 
Comprehension of organization of topic 
Sequence of events 
Reference skills 
A. Parts of book 
B. Alphabetizing 
C. Use of tables of contents 
D. Use of index 


S 2 e — 


Arithmetic 


I. 


Arithmetic reasoning 

A. Number concepts 
1. Writing numbers 
2. Writing money 


3. 
4. 
5. 


175 
Construction and Use of Standardized Achievement Tests 


Roman numerals 
Concept of whole numbers 
Concept of fractions, decimals, and per cent 


B. Signs and symbols 


15 
2. 


Signs 
Abbreviations 


C. Problems 


6 


One: step 

Two step 

Sharing and averaging 

Square measure and cubic content 
Percentage 

Ratio 


II. Arithmetic fundamentals 
A. Addition 


— 
D 


DO o D N Or 


Simple combinations 
Bridging 

Carrying 

Zeros 


Column addition 


Adding money 


Adding numerals 
Reducing fractions to common denominators 


Adding mixed numbers 
Adding fractions and decimals 


Writing decimals in columns 
1g. 


Denominate numbers 


B. Subtraction 


10. 


O p = e 


Simple combinations 
Borrowing 

Zeros 

Subtracting money 
Subtracting numerators 
Reducing fractions to com 
Borrowing with mixed numbers 
Subtracting fractions from decimals 
Writing decimals in columns 
Denominate numbers 


mon denominators 


C. Multiplication 


A 


we o bo 


Tables 

Zeros in multiplicand 
Zeros in multiplier 
Two-place multipliers 
Multipliers with fractions 


176 


Standardized Achievement Tests 


6. Cancellation of fractions 
7. Fractions and mixed or whole numbers 
8. Pointing off decimals 
9. Denominate numbers 
D. Division 
Tables 
Zeros in quotient 
Remainders 
. Inverting divisors in fractions 
Mixed numbers 
Reducing fractions to decimals 
Pointing off decimals 


S e 


A Case History of Achievement Test Construction. After the out- 
line is completed come the many labors of manufacturing and stand- 
ardizing the completed instrument. To illustrate how this is done, 
selected parts will be quoted! from the Directions for Administration, 
Metropolitan Achievement Tests for grades 3 and 4 (23). 


Curriculum research. The Metropolitan series attempts to measure 
those outcomes of instruction which, according to authoritative judgment 
and consensus of current practice, are the important goals of present ele- 
mentary instruction. To ascertain what these goals or outcomes are, sub- 
ject by subject and grade by grade, the authors reviewed expert pro- 
nouncements concerning the goals of elementary education, current re- 
search on the nature of essential skills, such as reading and the work- 
study skills, representative courses of study, and several widely used text- 
book series in the various branches. From these sources, they developed 
a detailed outline or blueprint for each test at each level, specifying the 
objective and, where appropriate, the content areas or topics to be 
covered and indicating the proportionate emphasis to be devoted to each 
objective or outcome as well as the desired distribution of the test con- 
tent among various areas. 

Experimental editions. After the specifications had been formulated, 
test items were prepared, edited, and in many cases reviewed by one or 
more subject-matter specialists. Considerable research was undertaken 
on matters of item type, appropriateness of directions, time limits. and 
related issues. Whenever there was any question about the appropriate- 
ness of a proposed new type of item, its suitability was experimentally 
verified before its adoption in the experimental forms. A total of ap- 
proximately fourteen thousand items were developed, 30 to 40 per cent 
more material than was ultimately to be used in the final forms. ? 

Tryout programs. The experimental forms were administered in a 
series of carefully planned programs to pupils in nine school systems. 


19 0 

The quoted material is presented by permission of Harcourt, Brace, & 3 75 1a 
Inc. Some sections of the quoted material have been omitted and other aan 
been slightly altered in order to conform to the style and purpose of this book. 


177 


Construction and Use of Standardized Achievement Tests 


W e systems cönstituted a varied group of communities, widely 
rgent with respect to type of pupil population, type of community 
textbooks in use, and other characteristics presumably related to achieve 
ment status. f 

Item analysis. The per cent of pupils answering correctly was computed 
for every item, separately for boys and girls, for each grade in which the 
item was administered. For most subtests, an item discrimination index 
was also computed, indicating how effectively the item distinguished 
between pupils scoring high and low. 

On the basis of the actual performance of the items in the tryout, final 
forms of the various tests were developed that are of appropriate dif- 
ficulty, both with respect to average difficulty and range of difficulty, and 
that discriminate as effectively as possible both among pupils in success- 
sive grades and among pupils of varying ability within each grade. The 
selection of items was carried on so as to produce equivalent final forms, 
each conforming to the original specifications established for the test. 

Teacher evaluation. All teachers who administered the experimental 
forms of the test were asked to comment on, and criticize, these forms, 
particularly with respect to appropriateness of the content for grades in 
which administered, clarity of directions, clarity of the test questions 


themselves, and general pupil and teacher reaction. 


In addition to the steps described in the quoted material above, 
numerous other labors were required to produce the final battery of 
achievement tests. The final forms were administered to over five 
hundred thousand pupils for the purposes of obtaining norms. These 
included students from 225 school systems in 49 states. For each grade 
level, raw scores were transformed to grade norms, percentiles, stand- 
ard scores, and transformed standard scores. Finally, investigations 
Were made of test reliability. Split-half reliability correlations were 
Separately computed for each grade level in each of several school 
systems. Reliabilities were found to be generally good, averaging about 
90. After the study of reliability was completed, tests and test manuals 
Were prepared for publication. Since their publication in 1959, The 
Metropolitan Achievement Tests have been widely used. Some other 
good achievement tests are described in Chapters 9 and 10, and a 
More extensive list is given in Appendix D. 


Major Uses of Achievement Tests 

vide such a wealth of information 
actical to describe all their uses. 
ajor types of educational 


Standardized achievement tests pro 
Fate students that it would be impra 
„ are descriptions of some of the m 
sions that are aided by the use of achievement tests. 
5 and Section Placement. Whenever there are qu 
assignment of students to particular grade levels, achievement 


estions about 


178 


Standardized Achievement Tests 


tests provide very useful information. Achievement tests are par- 
ticularly helpful in deciding on the grade level for transfer students. 
The student coming from another school in another locality might have 
been fed a very different kind of educational fare from what he will 
receive in his new school. The over-all ability levels of students in the 
two schools might differ greatly, and the emphases in instruction might 
be very different. The grades that students bring with them are of 
some help in assigning transfer students to grade levels, but they leave 
many questions unanswered. Although the student may have received 
instruction in social studies, do the two schools mean the same thing 
by “social studies”? Although the student may have received passing 
grades in his previous school, does the previous school have different 
standards than the new school? Standardized achievement tests help 
answer such questions. 

Regardless of where the student previously attended school, he prob- 
ably will have the results of achievement tests on his records. These 
scores can be interpreted directly in the new school setting. Even if 
the previous school employed a different achievement test battery 
from that used in the new school, the needed comparisons can be 
made. If the tests administered in the previous school show that the 
student stands at the 55th percentile in reading achievement com- 
pared with national norms, this can be compared directly with the 
average reading score obtained by students at the new school. This 
way it can be told whether or not a student is likely to perform satis- 
factorily at a particular grade level. 

Another important use of achievement tests is in grouping children 
according to ability levels. Of course there is considerable controversy 
about whether or not ability grouping should be done; but if it is 
done, standardized achievement tests are helpful in making the nec- 
essary decisions. For grouping students in the first three grades, in- 
telligence tests are somewhat better than achievement tests. This is 
because (a) at that level school topics are not sufficiently broad to tap 
student's potential for later schoolwork, (b) successful performance 
in the first several grades is dependent to some extent on “incidental 
abilities, such as is evidenced in the rote memorization of multiplica- 
tion tables, (c) some children do not “settle down” in their schoolwork 
until they have passed the primary grades, and (d) achievement in 
the first several grades is dependent to some extent on whether or not 
students attended kindergarten and on whether or not parents have 
undertaken some preschool “coaching” of their children. Although in- 
telligence tests do not entirely circumvent these sources of error in 
forecasting the long-range accomplishments of students in school 
topics, at the primary level they are somewhat broader and less de- 
pendent on incidental factors than are achievement tests. 

Beginning with about the fourth grade, achievement tests start to 


179 
Construction and Use of Standardized Achievement Tests 


surpass intelligence tests in terms of importance in making ability 
groupings. After that point, intelligence tests still provide useful sup- 
plementary information, but standardized achievement tests become 
the mainstays for making decisions about ability groupings. Also, at 
every level, course grades and opinions of teachers are important 
sources of information for use in assigning pupils to their proper 
sections. 

Grades and Promotion. Grades usually are based mainly on 
teacher-made tests rather than on achievement tests. Partly this is a 
matter of necessity, because comprehensive achievement tests are 
given only once a year, or in many school systems only several times 
during elementary school. Aside from questions of necessity, teacher- 
made tests generally are preferable for the assignment of term grades 
and periodic grades during the school year. This is because the individ- 
ual school and the individual teacher have the right to judge the 
Progress of their students. No matter how comprehensive the achieve- 
ment test is intended to be, it cannot measure all the important mate- 
rial covered at any grade level. Also, the aims of instruction vary some- 
what from school to school, and achievement tests cannot be used as 
infallible guides to how well students are meeting local objectives. 

Although achievement tests seldom are directly involved in the as- 
Signment of grades, they often do play a part in making decisions 
about promotion. Achievement tests supplement the teacher’s own 
tests, and they provide a means of comparing a student’s perform- 
ance with students in general. Regardless of how hard teachers try to 
avoid being influenced by such considerations, decisions to either pro- 
Mote or not promote students are based to some extent on personal 
Prejudices, Of course there is no possibility for such personal preju- 
dices to affect the results of achievement tests. Also, in some instances, 
decisions about promotion hinge on tests composed by teachers who 
habitually compose very poor tests. Regardless of whether or not the 
achievement test contains all the important material, or whether or 
not it is maximally slanted toward local goals, it usually is well con- 
structed, For these reasons, in making decisions about promotion, it is 
Wise for schools to use standardized achievement tests to supplement 
the results from teacher-made tests. 

Evaluation of Teachers. Some teachers are unfriendly to achieve- 
Ment tests because they vieW them as tests of teaching ability. For ex- 
ample, if it is found that students do poorly on an achievement test in 
biology compared with scores obtained by students in neighboring 
Schools, it suggests that the biology teacher is not doing a very good 
job. For a number of reasons it is doubtful that school administrators 
should systematically use the results of achievement tests to evaluate 
individual teachers. 


Obviously, the threat of “exposure” can be quite frightening to 


180 
Standardized Achievement Tests 


teachers and can lead to poor educational practices. If a high premium 
is placed on the results of achievement tests, teachers are prone to 
emphasize in their instruction those skills and kinds of knowledge 
measured by the tests. This will prevent teachers from spending time 
on matters not directly represented in the tests, reduce the teacher's 
interest in promoting character and good attitudes toward learning, 
and make teachers reluctant to experiment with promising new ap- 
proaches to instruction. 

Even if it were not for the poor attitudes engendered by the prac- 
tice, the use of achievement tests to evaluate teachers often is on 
shaky ground. How well a class performs is directly related to the ap- 
titude of the students. In some schools most of the students score 
above average in comparison to national norms. In other schools most 
of the students score below average. Obviously, the teacher cannot be 
given the credit, or suffer the blame, for the initial levels of aptitude 
of his students. 

Achievement tests measure not only how much students have learned 
in a particular school year but also how much they have learned in all 
their prior schooling. Consequently, teachers are not solely respon- 
sible for the scores their students make in any one year. For example, 
if the third-grade teacher gives very poor instruction in number skills, 
this will carry over to the fourth grade and be evidenced to some ex- 
tent on an achievement test administered at the end of the year. 

After the factors discussed above have been considered, it cannot be 
denied that the individual teacher plays an important part in deter- 
mining how well his students do on achievement tests. Although it is 
not recommended that achievement tests generally be used by school 
administrators to evaluate teachers, it is very helpful for teachers to 
know how well their students perform. It is best when teachers are 
furnished with information on their students by some “neutral” person 
or agency rather than by school principals and other administrative 
officials in the school system. This way, teachers can use the results 
from achievement tests but not be threatened by them and not be 
made to tailor instruction entirely to what the tests measure. For ex- 
ample, it would help a teacher to know that his students perform 
relatively much better in reading than in grammatical usage. The 
teacher may have purposefully deemphasized the learning of gram- 
matical usage and, perhaps for good reason, decide to continue that 
practice in the future. However, teachers should be fully aware of 
their relative emphases, and either continue present practices or make 
changes on a rational rather than a hit-and-miss basis. If the threat is 
removed, achievement test results provide teachers with valuable in- 
formation for judging their over-all and differential effects on students 
and for planning their instruction with future classes. 


181 


Construction and Use of Standardized Achievement Tests 


Guidance. It should be obvious that the results of standardized 
achievement tests are very helpful in dealing with students who have 
problems of one kind or another. In conjunction with course grades, 
achievement tests provide a measure of how well students have pro- 
gressed in their schoolwork up to a particular point in time. Let us 
look at two problems in which the results of achievement tests would 
be helpful. 

Scott Kendall is in trouble in the fourth grade. He is overactive and 
undisciplined, His teacher feels that Scott is actually learning very 
little, and she recommends to the guidance counselor that Scott be 
given remedial instruction. One of the first things the guidance coun- 
selor does is to look at Scott’s scores obtained on comprehensive 
achievement tests in the first three grades. Surprisingly enough he is 
well above the average of his classmates in all topics. Discussions 
with Scott’s previous teachers show that he has always been hard to 
handle. However, Scott does not need remedial instruction. Even 
though he does not cooperate, and apparently does not study, he is 
managing to do rather well in school topics. The guidance counselor 
decides that, rather than provide remedial instruction, discussions will 
be held with Scott's parents, in which suggestions will be made about 
providing some controls over Scotts obstreperous conduct. 

Bill is a high school student in Minneapolis. He graduates at the 
end of the school year, and he is considering applying for admission 
to the University of Minnesota. He is concerned because his high 
school grades are only average, and he wonders whether or not he has 
what it takes to perform well in college. What Bill does not fully 
understand is that he is in a much above average high school, in a 
very well-to-do neighborhood. The high school counselor shows Bill 
ents across the country, he stands at the 
90th percentile on a previously administered achievement test. The 
Counselor makes an even more important comparison for Bill. He shows 
Bill a table in which the grade averages of students at the University 
of Minnesota are compared with the achievement test scores which 
they earlier made in high school. By this comparison it is easy for Bill 
to see that he has the potential for performing very well at the Uni- 
versity, Obviously, if the achievement test scores had not been avail- 
able and Bill had made his decision on the basis of grades alone, he 
Might have been discouraged from going to the University. 

Planning Daily Instruction. If, standardized achievement tests 
Served no other purpose, they would still be of value to the individual 
teacher in planning the day-to-day instruction of his students. Each 
i © class in which each student is largely an 
students mastered their previous train- 
points and their weak points? 


that, in comparison with stud 


year : A 
Year, teachers receive a new cl 
unknown, How well have the 
2 3 err. 
ing? As a group, what are their strong 


182 
Standardized Achievement Tests 


Which students need special attention? If comprehensive achievement 
tests were administered late in the previous school year or at the be- 
ginning of the new year, the results will go a long way toward provid- 
ing the teacher with the information he needs to deal effectively with 
the class as a whole and with individual students. Without this infor- 
mation, teachers would have to spend much of the time during the 
first month or two of school trying to size up the previous accomplish- 
ments and capabilities of their new students. By then, much valuable 
time would have been lost, the class as a whole might have been set on 
improper courses of study, and the problems of particular students 
would not have been fully understood. 

Following are several examples of how results of achievement tests 
help in planning day-to-day activities in the classroom. For some time 
the teacher noted that Jack Whitmore apparently had some difficulties 
in reading, but it was difficult to tell exactly what the problem was. 
Some clues were obtained on a special achievement test for reading 
skills. Jack performed well in the section of the test concerning word 
knowledge, and average in the section on reading comprehension; but 
in the section on rate of reading it was found that he read very slowly. 
During the next week the teacher could see that Jack was very timid 
about reading aloud in front of other students and that he spent much 
time in looking up from his page to see how fellow students were re- 
acting. Perhaps timidity about “performing” in front of others had 
generated bad habits in reading. For the next two months, the teacher 
does not ask Jack to read aloud. Instead he is asked to read silently, 
then tell the teacher what he read, Also, the teacher made special 
efforts to be understanding of Jack's shyness. With these and other 
remedial practices, Jacks reading speed gradually increased to the 
normal rate. 

Mrs. Long is teaching the fourth grade in a new consolidated ele- 
mentary school that draws students from districts that each formerly 
had small schools. Mrs. Long worries that her new students will differ 
markedly from one another in terms of the quality of their previous 
training. Inspection of achievement test results confirm her suspicions. 
Most of the children apparently are up to grade level in achievement, 
but six students who all came from the same small school are perform- 
ing far below grade level. Mrs. Long decides that for the first several 
months she will start those six children on third-grade level readings 
and exercises and then see if in time they can catch up with the class 
as a whole. 

Wiley School is very proud of its efforts to introduce science topics 
into the curriculum, Intensive treatment of science topics is planned 
for the sixth and seventh grades. The first year that the new program 
is tried, teachers are shocked to find that their sixth-grade students 


183 


Construction and Use of Standardized Achievement Tests 


performed only average on the science section of a comprehensive 
achievement test. However, after looking carefully at the test items 
and at the responses of their students to the test, the teachers could 
see the problem. The “science” section of the test contained a broad 
collection of items on biology, physics, astronomy, geology, and others. 
In the curriculum it had been planned to emphasize the life sciences 
in the sixth grade and the physical sciences in the seventh grade. Con- 
sequently, teachers decided not to worry about the apparent “aver- 
age” performance and to wait and see how well the students per- 
formed after they had been introduced to physical science in the 
seventh grade. By the end of the seventh grade, scores of students on 
the science section of the achievement test went up markedly, which 
reinforced the school in maintaining its new approach to teaching 
science. 

Research. Much educational research would be all but impossible 
to conduct without the availability of standardized achievement tests. 
This is so both for the informal research that whole schools and indi- 
vidual teachers conduct on their educational practices and for the 
More formal research undertaken by psychologists and educational 
research specialists. Two examples will suffice to illustrate such uses. 

In a particular school district, parents are alarmed by the apparently 
Poor caliber of the schools. A movement is started to provide more 
money for the schools, to obtain better school administrators, and to 
upgrade the caliber of the teachers. Numerous steps are taken to im- 
Prove the schools, Several years later everyone wonders whether or 
not the measures have worked. How can the question be answered? 
At first thought, one might study the grades of students before and 
after the change, but it would be misleading to do that. Because of 
tightening up of standards, many individual students, and students on 
the average, may be receiving poorer grades after the change than be- 
fore, Only by the use of standardized achievement tests can it be told 
whether or not the changes have worked. If, in comparison to national 
norms, students as a group stand higher after the changes than before, 
there is good reason to believe that the changes have been beneficial. 

In a more formal investigation, psychologists are studying the effec- 
tiveness of a new approach to teaching mathematics to fifth-grade 
children, The new approach minimizes rote memorization of mathe- 
matical procedures and, instead, relies heavily on imparting some of 
the central “ideas” required to understand mathematics. The new ap- 
Proach is tried in ten schools. Some sections of the fifth grade in each 
School use the new method, and others employ traditional methods. 

Yow it must be determined whether or not the new method is actually 
an improvement. This could not be determined by comparing the 
teacher-made tests and term grades in the two types of instruction, 


184 


Standardized Achievement Tests 


because the two types of instruction have been concerned with largely 
different material. The only sensible way to make the comparison is 
with a standardized achievement test, either one that is regularly em- 
ployed in schools or one that is specifically designed for the research 
project. 


Summary 


The use of standardized achievement tests has become so common- 
place in elementary and secondary schools that we seldom stop to con- 
sider what their properties are and why they are used. Standardized 
achievement tests share a number of properties with teacher-made 
tests. Principally, both are concerned with school-learned concepts 
and skills, and both depend on content representativeness for validity. 
Standardized achievement tests tend to differ from most teacher-made 
tests in that (a) their content coverage is broader and less “deep” with 
respect to some aspects of subject matter, (b) they are intended to 
measure over-all progress in school rather than progress only in par- 
ticular units of instruction or only in particular subjects, and (c) they 
usually are much more carefully constructed and standardized than is 
possible for most teacher-made tests. 

The construction of a new achievement test is a large-scale under- 
taking, which may require several years and consume many thousands 
of dollars. Some of the major steps that must be taken are (a) con- 
sultation with teachers and subject-matter specialists to determine 
the goals of the instrument, (b) composition of a detailed outline of 
the content to be included, (c) writing and rewriting of thousands of 
test items, (d) large-scale tryout and statistical analysis of items, (e) 
construction of final forms, (f) administration of tests to thousands of 
students to obtain norms, and, finally, (g) composition of testing 
manuals, 

Rather than being competitive with each other, teacher-made tests 
and comprehensive achievement tests have their special advantages in 
shaping educational decisions, and they share in the making of some 
kinds of decisions. Teacher-made tests usually are better for (a) de- 
termining how well students perform in particular units of instruction, 
egn trigonometry; (b) determining course grades; (c) structuring 
day-to-day activity in the classroom; and (d) determining how well 
local objectives are being met in the instruction. Achievement tests 
have an advantage in making certain types of comparisons, such as 
(a) comparisons of students’ progress from year to year, (b) com- 
parisons of performance in different schools and in different localities. 
(c) comparisons of students’ performance with that of students across 
the country, and (d) the comparisons that are required in educational 


185 


Construction and Use of Standardized Achievement Tests 


research, Teacher-made tests and standardized achievement tests share 
in the making of many types of educational decisions, particularly 
those regarding grade placement, promotion, and the handling of any 
type of problem student. 


Suggested Additional Readings 


American Educational Research Association and National Council on Measure- 
ments Used in Education, Committee on Test Standards. Technical recom- 
mendations for achievement tests. Washington: National Education Association, 
1955. 

Anastasi, Anne, Psychological testing. 
chaps, 16 and 17. 

Bean, K. L. Construction of educational and personnel tests. 
Hill, 1953. 

Noll, V. H. Introduction to educational measuremen 
1957. Pp, 90-107. 

Wood, Dorothy A. Test construction. 


(2nd ed.) New York: Macmillan, 1961, 
New York: McGraw- 
t. Boston: Houghton Mifflin, 


Columbus, Ohio: Merrill, 1960. 


chapter 


Comprehensive 


Achievement Tests 


Comprehensive achievement tests are the mainstay of school-wide 
testing programs. They provide the best evidence available regarding 
the over-all educational progress of students up to particular points in 
time. Their results are sufficiently valuable that, if the school can 
afford it, comprehensive achievement tests should be administered 
each year to all students. n 

Comprehensive achievement tests of today are far better than their 
predecessors of twenty and thirty years ago. Earlier tests tended to be 
narrow in content, emphasized only rote memory and simple skills, 
and were standardized as the result of only meager research, Some of 
today’s tests represent mammoth undertakings, and they go a long way 
toward measuring the truly important goals of education. In this chap- 
ter we will look at the typical content of comprehensive tests, see ex- 
amples of different kinds of tests employed at different grade levels, 
and discuss some general principles for the effective use of the in- 
struments. 


Content of Comprehensive Tests 


Comprehensive achievements differ from one another principally 
in terms of (a) the broadness of their coverage and (b) the level of 
understanding which they measure. Regarding the former, compre- 
hensive tests tend to incorporate an increasingly wider range of con- 
tent in going from tests appropriate to the primary grades to those ap- 
propriate to the last two years of high school. The increasing breadth 
of subject matter is illustrated in Table 9-1, which shows the number 
and kinds of subtests employed in the Metropolitan Achievement 
Tests (23) at different grade levels. Whereas only four kinds of con- 
tent are included in the primary I test, thirteen kinds of content are 
included in the advanced test. The progressive broadening of content 


186 


187 


Comprehensive Achievement Tests 


Table 9-1: Content of Metropolitan Achievement Tests 


Grade level 


Primary 1 Primary II Ele- Inter- Ad- 


(first (second mentary mediate vanced 
Subtests grade) grade) (3-4) (5-6) (7-9) 
Word knowledge x x x x x 
Word discrimination x x £ 
Reading x x x x x 
Arithmetic: 
Concepts and skills x x 
Problem solving x x x 
Computation x * * 
Spelling x x x 
Language: 
Usage x x * 
Punctuation and cap. x x * 
Parts of speech and grammar * * 
Kinds of sentences z 
Language study skills = a 
Social studies information a 8 
Social studies study skills Z li; 
x x 


Science 


in comprehensive tests is intended to keep pace with the actual scope 
; 


of classroom instruction. Admittedly, during the first several primary 
grades, the content is largely restricted to reading and simple arith- 
metic skills. In contrast, by the end of high school students have 
studied so many different topics that a truly comprehensive test must 
be very long (often requiring over six hours to administer ) and must 
Contain many different types of content. ; ; 
Regarding the level of understanding measured by different tests, in 
Varying degrees today’s tests tend to stress understanding rather than 
rote memorization only. Rote memory consists in such things as re- 
membering the date of the Declaration of Independence, remembering 
that four times eight is thirty-two, and remembering how to correctly 
spell the word “business.” What we usually mean by understanding 
is the ability to see relations among facts. Thus, a student might mem- 
orize Newton's laws of motion without really understanding them. 


He would evidence understanding by using the laws to explain natural 


events, e.g., why satellites remain in orbit, There are many levels of 


understanding, ranging from a simple extrapolation of a principle to 
obviously related events to the pulling together of diverse events into 
One over-all explanation. 


Of course, the major goal of 
understanding rather than to s 


education is to provide students with 
tress rote memory only. Even if rote 


188 


Standardized Achievement Tests 


memory were the goal (as unfortunately it still is for some teachers), 
it would be a futile aspiration. Unless the material which is memorized 
is used frequently, e.g., multiplication tables, it will greatly fade with 
time. However, at even the highest levels of education, some defini- 
tions and facts must be memorized before understanding is possible. 
This is particularly true in the primary grades, where so much simple 
material must be mastered. For this reason, comprehensive achieve- 
ment tests for the primary grades tend to contain a considerable 
amount of simple rules and facts relating to reading, language, and 
arithmetic. After the primary grades, broader understanding becomes 
increasingly important up through high school and beyond. 

At all levels, it is important for comprehensive achievement tests to 
employ a balance of material concerning different levels of knowledge. 
First, some comprehension of simple facts, definitions, and rules is im- 
portant at all levels. Second, understanding of restricted generaliza- 
tions is important in each subject matter, e.g., biology or civics. That 
is, if students have mastered particular topics, they should be able to 
successfully employ related principles, e.g., as mentioned previously, 
to reason why satellites remain in order, Third, at the highest level of 
understanding, students should learn to see relations among diverse 
events and give critical reactions to arguments, 

Of course, it is relatively easy to test for rote memory and for sim- 
ple levels of understanding, but it is a real challenge to effectively 
measure higher levels of understanding. Examples were given pre- 
viously of how various types of understanding can be measured, and 
some of the available comprehensive achievement tests do an admir- 
able job in that respect. As an example from one test concerning the 
understanding of literature, the student is provided with a segment 
from a play and asked to respond to objective items concerning infer- 
ences about events prior to the scene, intentions of the actors, and 
purposes of the author in the casting of characters. As another example, 
the understanding of meteorological principles is tested with the aid 
of a complex graph showing relations among rainfall, barometric pres- 
sure, and temperature in different localities over a period of months. 
Multiple-choice questions concern the kinds of generalizations that 
can be drawn from the data depicted. As a third example, to test the 
understanding of numerical concepts, the student is presented with 
the number 348 and asked, “What does the 3 stand for?” Other types 
of items for measuring various levels of understanding will be illus- 
trated in the following sections. 

Word Knowledge. Most of the comprehensive achievement tests 
have a subtest concerning the meaning of words. Of course, a knowl- 
edge of individual words is prerequisite to all learning, and it is one 
of the essential things to measure at each grade level. Word knowl- 


189 


Comprehensive Achievement Tests 


edge is relatively easy to measure with objective test items, there 
being many different types of items to use. Most widely used is the 
straightforward, multiple-choice definitional form: 


Ample means most nearly the same as 
a. scarce c. sufficient 
b. holy d. lovable 


Also frequently used is the “opposites” form: 


Ample means the opposite of 
a. meager c. kind 
b. hard d. full 


For the primary grades, items frequently employ pictures: 


a. horse c. rabbit 
b. cow > d. cat 


Some tests combine word knowledge with reading comprehension and 
obably wiser to score the 


provide one over-all score for the two. It is pr Score 
two separately. Some students who know the meanings of individual 
words have difficulty in determining the meaning of whole sentences 
and paragraphs. Also, some students manage to get the gist of con- 
hected discourse but fail to understand the meanings of some of the 
individual words. 


Reading Comprehension. All the comprehensive tests employ 


basically the same type of item for the measurement of reading. Stu- 
dents are presented with connected passages, varying in length from 
a dozen to five hundred or more words. The material is either an ex- 
cerpt from a story or a description of some event. After reading the 
Material, the student is asked questions about what was read, An ex- 
ample from a first-grade test (Metropolitan Achievement Test) is as 


follows: 


a. a girl 
b. a bird 
c. a dog 


I can fly. 

I can sing. 

I have a nest. 

Who am 1? 
A more advanced paragraph (Metropolitan Achievement Test) is as 
follows: 


190 


Standardized Achievement Tests 


Frank has a good hobby. He collects stamps. He has stamps from many 
different places. Of course, he has many United States stamps. He saves 
them from letters he gets from his Aunt Carrie in Texas and his Cousin 
Jack in Ohio. But Frank also has stamps from foreign countries. 


Frank’s Aunt Carrie lives in 
a. Ohio c. New York 
b. Africa d. Texas 


In this story, the word saves means 
a. rescues c. keeps 
b. protects d. prevents 


The items on reading comprehension tests vary from those requiring 
the comprehension of simple facts to those requiring inferences and 
critical reactions. An example of a paragraph which tests more com- 
plex types of understanding is as follows: 


For several months Mr. Williams has heard a strange noise in his auto- 
mobile. One day he takes it to a repair shop. The repairman inspects the 
engine and says that a small part must be replaced. The repairman says 
that he is not sure when he will find time to make the repairs. Because 
Mr. Williams will not need the automobile for several days, he leaves it 
with the repairman. Two days later he returns and finds that the automo- 
bile has not yet been repaired. The repairman says, “I will fix it for you 
now.” Fifteen minutes later the work is completed. Then the repairman 
gives Mr. Williams a bill for twenty dollars. Five dollars is for the part 
replaced, and fifteen dollars is for installing the part. Mr. Williams is 
quite angry at the size of the bill, and he argues with the repairman that 
too much has been charged. 


Mr. Williams is angry because 

a. the repairman waited two days before making the repairs 

b. the repairman is rude 

c. the work is poorly done 

d. too much is charged for installing the part 

Mr. Williams can avoid such arguments in the future by 

a. asking in advance how much repairs will cost 

b. driving his automobile more carefully to prevent breakdowns 

c. taking his automobile to a repair shop the minute that he hears a 
strange noise 

d. trying different repairmen until he finds one that he likes 


Language Skills. Most of the comprehensive achievement tosti 
have sections dealing with language skills. The available tests wn 
somewhat in coverage, but generally they include material on a 
ing, punctuation, capitalization, case, and other elements ppe “a 
matical usage. Spelling is tested in various ways, one of eo ace 
provide the teacher with a standard list to read to students. Stud 


191 


Comprehensive Achievement Tests 


write the words as the teacher pronounces them. Various objective 
types of items also are used to measure achievement in spelling: 


Johnny ate a ham 
a. sandwich 

b. sandwitch 

c. sanwich 


V. s 5 9 9 = 9 
We sing in the auditoriem. RIGHT WRONG 


One of the following words is misspelled. Mark it. 
sidewalk 

material 

— diffrent 

— immediate 


Numerous objective-type item forms also are used to measure other 


aspects of language skills. 


Mark the one that is correct. 
we shall leave after we eat. 
we shall leave. After we eat. 


The boy behaved 
a. badly 
b. bad 


John has ate his lunch. RIGHT WRONG 


good job of measuring most aspects of 
ve tests of punctuation and spell- 
f the same skills. Where the avail- 
measure language skills 


Objective test items do a very 
language skills. For example, objecti 
Ing correlate highly with essay tests 0 
able comprehensive achievement tests fail to 1 


is in the actual composition of written material. Only one of the tests 
(the STEP, see Appendix D-1) has a section in which students are 
required to write on specified topics. This illustrates what has been 

comprehensive achievement tests meas- 
ant skills and knowledge. Teachers 
d tests to assess the ability to com- 


Said several times previously: 
ure most, but not all, of the import 
Must rely on their own exercises an 
Pose written material. . 
Arithmetic. All the comprehensive achievement tests have sections 
another in their rela- 


dealing with arithmetic. The tests differ from one a 
utation, arithmetic reasoning, and arithmetic 
ation, 


tive ; 
ve emphasis on comp ! 
ational items are: 


concepts, Examples of comput: 


Add: 189 Subtract: 


192 
Standardized Achievement Tests 


In arithmetic reasoning items students are required to reason out 
the solutions to problems: 


Billy sells magazines in his neighborhood. He sells 14 each week, and 
he makes 5 cents on each. If he wants to earn $1.00 each week, how 
many more magazines must he sell? 

a6 h. 14 c. 24 dl. 3 


The scale of a map reads that 1 inch equals 60 miles. How many 
inches long would be a line on the map to show a distance of 40 miles? 
a. Win 5. %in. c. 17½ in. d. 1 in. 


Arithmetic concepts concern relations among arithmetic operations 
and relations between computational procedures and events in daily 
life. The following items illustrate the measurement of arithmetic 
concepts: 


Which one of the following months comes before June? 
a. April b. August c. July d. November 


Jack has saved 425 pennies. This equals: 
a. $425 b. 4.25 cents c. $42.50 d. $4.25 


Which one of the following lines is closest to one inch? 


Which number is closest to 400? 
a. 399 b. 395 c. 420 d. 402 


How many hours are there between 8 o'clock in the morning and 8 
o'clock at night? 
a8 b. 12 c. 10 d. 16 


Numerator is a word used in 
d. addition b. subtraction c. multiplication d. division 


Over the years the emphasis in achievement tests has shifted from 
computation to arithmetic reasoning and arithmetic concepts. Most 
teachers will agree that this is a healthy change. Computation is im- 
portant, but not as important as understanding the meaning of arith- 
metic operations and knowing how to use arithmetic operations to 
solve problems. Some computation is required in most items concern- 
ing arithmetic reasoning and arithmetic concepts. To prevent these 
items from measuring only computational skills, the computations are 
kept simple. You will notice in the items above used to illustrate arith- 
metic reasoning and arithmetic concepts that some computations are 


193 


Comprehensive Achievement Tests 


required, but these are very simple computations. More complex com- 
putations should be reserved for those items specifically intended to 
measure computational skills. 

One caution that must be observed in constructing arithmetic items 
is that only very simple words and sentences should be used to state 
problems. Otherwise, as often happens, the test concerns reading skills 
more than the understanding of arithmetic. 

Study Skills. It has often been said that one of the major purposes 
of formal schooling is to teach people how to learn on their own. In 
the classroom much time is spent in developing study skills. Students 
learn where to find information in reference sources, how to interpret 
charts and maps, and many other study skills. Because of the growing 
realization that study skills are important aspects of classroom learn- 
ing, most of the comprehensive achievement tests now include related 
sections. Either the items are interspersed in subtests dealing with 
content areas, such as social studies and science, or they are made into 
separate subtests. 


In a typical item relating to study skills, the test shows a graph 


depicting the amount of rainfall in a half-dozen cities. Multiple-choice 
questions are posed such as “Which city had the greatest average rain- 
fall?” “Which city had the lowest average rainfall?” “What was the 
average rainfall in Chicago?” In another item a road map is shown 
which covers parts of four counties. Multiple-choice questions are 
posed such as “The most mountainous soad ase Nee 
ton and Clancy are joined by 1 The largest town on 
the BB 25 

Typical items concerning where to find information, and how to 


interpret reference sources are: 
Where would the part played by the United States Marines in the Second 
World War be located? 
a. information almanac 
b. history book 
c. dictionary 
d. atlas 


A dictionary would indicate the accented syllable in primary as: 


a. pri’macry 

b. pri mary 

c. PRI ma' ry 

d. primary 

Content Areas. Word knowledge, reading comprehension, language 
e core courses in elementary 


skills, arithmetic, and study skills are th : 
schools and, to a large extent, throughout all higher levels of education. 


194 


Standardized Achievement Tests 


Of course, in addition to the core topics, students learn other things, 
including history, geography, biology, physics, and many others, These 
are referred to as content areas. Should the content areas be repre- 
sented on comprehensive achievement tests? The question is difficult 
to answer at the present time, which is evidenced by the fact that 
about half of the available tests do include sections relating to content 
areas and half of them do not. 

Good arguments can be made both for and against the inclusion of 
such material. The most obvious argument for including content areas 
is that achievement tests are then more representative of the total 
school curriculum. There are two good negative arguments. The first 
is that content areas are secondary to the core topics, i.e., reading com- 
prehension and arithmetic reasoning, and that it would be an impru- 
dent use of space in achievement tests to include them, In rebuttal to 
this, it can be said that whereas the core topics admittedly are the 
most important parts of education, the content areas also are impor- 
tant. The second negative argument is a practical one. Because of the 
diversity of material included in the content areas and because of the 
different coverages in different schools, it is difficult to construct truly 
representative measures of these topics. It is hard to entirely disclaim 
this argument. For example, in former years most comprehensive tests 
included sections on literature; but it was found that reading materials 
varied so much from locality to locality that it was almost impossible 
to assemble a representative collection of test items. A compromise is 
to include material relating to those content areas that are covered in 
most schools, For example, it is rather universal to teach American 
history, elementary facts of human biology, the essentials of astronomy, 
and some simple physical principles. These can be made the basis of 
subtests for content areas. 

The inclusion of material relating to content areas in comprehensive 
achievement tests varies with the grade level. During the first four 
grades of elementary school, the overriding emphasis is on the core 
topics, and an insufficient amount of the content areas has been 
covered to justify their inclusion on achievement tests. Beginning with 
about the fifth grade, more attention is given to the content areas, and 
some of the achievement tests for that level, and higher levels, have 
subtests covering them. From the ninth to the twelfth grades, the cur- 
riculum is highly oriented toward content areas, so much so that it is 
difficult for comprehensive tests to provide a sufficiently broad cover- 
age. At these levels, comprehensive tests can be augmented by tests 
for special topics, e.g., natural science. Tests for special topics will be 
discussed in Chapter 10. 

Those tests that include content areas group the items into two 
broad categories: social studies and science. Social studies usually 


195 


Comprehensive Achievement Tests 


comprises material from history, civics, and geography. Typical items 
are: 


The Amazon river is in 
a. Africa b. South America c. Europe d. Asia 


The President of the United States 

a. controls the Armed Forces 

b. makes new laws 

c. controls the Supreme Court 

d. makes changes in the Constitution 


The Pilgrims lived mainly from 
a. manufacturing 

b. mining of gold and silver 
c. hunting and farming 

d. trading with Africa 


A governor is an officer of a 
a. city b. county c. state d. nation 


The state best suited for growing corn is 
a. Iowa b. Pennsylvania c. Arizona d. Oregon 


Content subtests for science typically contain a mixture of items 


from physics, biology, botany, and astronomy. Typical items are: 


8 


> 


People use a telescope to make things look 


a. larger b. smaller c. brighter d. smoother 


Plant leaves absorb the energy of the sun by 
a. photosynthesis b. osmosis . vmbiosis d. parthenogenesis 


The lungs remove from the blood 
a. helium b. carbon dioxide c. nitrous ammonia d. oxygen 


Flowers multiply with the help of 
a. bees b. ants c. grasshoppers d. beetles 


Principles for Using Comprehensive Achievement Tests 


s for using achievement tests were 
discussed in the previous chapter; however, a few special points need 
to be considered. These are discussed in the following sections. 
When to Administer. There are two important questions to be 
considered here. The first is how frequently should achievement tests 
De administered during the elementary and high school years. If a 


school can afford the time and expense, it would be worthwhile to ad- 
ry year from the first grade through 


Most of the important principle: 


mini s 
Minister comprehensive tests eve 


196 
Standardized Achievement Tests 


high school. If it were possible to use achievement tests at only one 
point between the first grade and the senior year of high school, the 
best point probably would be at the end of the fourth grade. By that 
time the student has enough schooling behind him so that achievement 
tests give a good representation of school performance. Yet at that 
point most of the students formal schooling is in front of him, and 
there is ample time to make corrective changes. If achievement tests 
could be administered at only three points in a students schooling, the 
best times would probably be at the second, fourth, and sixth grades. 
Some school systems employ comprehensive achievement tests only 
for the elementary grades. With some logic they argue that (a) the 
diversity of topics studied in high school is not given sufficient cover- 
age on the tests, and (b) there is little point in giving the tests at the 
high school level because by that time much of the good or bad has 
already been done. The author reiterates his stand that if the school 
can afford it, it is wise to give comprehensive achievement tests every 
year. 

The second question to answer is that of the time of year in which 
achievement tests should be administered. In answering this question, 
one thing is sure: a test should be administered at the time stipulated 
in the test manual for the particular test. If the manual says that the 
test is to be administered in the month of May, it should be admin- 
istered at that time. Otherwise, the test norms will not apply to the 
particular group of students. If students have several more months of 
schooling than those on whom the test was standardized, they will 
appear to perform better than their actual abilities warrant. 

Still left unanswered is the question of the best time of year for ad- 
ministering achievement tests. One school of thought says that achieve- 
ment tests should be administered at the beginning of the school year, 
and the other school of thought says that they should be administered 
at the end of the school year. There are good arguments for both points 
of view. The major argument for administering the tests at the be- 
ginning of the school year is that this supplies the teacher with up-to- 
date information about the strong and weak points of each student. 
The major argument for administering tests at the end of the school 
year is that this supplies information needed for decisions about pro- 
motion and for sectioning of students at the beginning of the fall term. 
Because of the increasing use of summer school for both good and 
poor students, achievement tests administered near the end of the 
school year provide valuable information for structuring activities in 
the summer. Also, scores made at the end of the previous year gener- 
ally are not very different from scores students would make at the be- 
ginning of the following year. The author feels that the weight of the 
argument is in favor of administering comprehensive achievement 


tests near the end of the school year. 


197 


Comprehensive Achievement Tests 


How to Administer Achievement Tests. When comprehensive 
achievement tests were new, it was the custom to have them admin- 
istered by an “expert,” e.g., a school psychologist. Now that achieve- 
ment tests are used so widely and so frequently, the practice has 
changed to having teachers administer achievement tests to their own 
students. Most of the widely used achievement tests have simplified 
their testing procedures to the point where any conscientious teacher 
can adequately administer them, Test manuals usually are quite ex- 
plicit about how to administer the instruments. Of course, teachers 
should follow instructions to the letter and not deviate from the stand- 
ard procedures. To the extent to which teachers in any way give their 
students advantages not allowable in using the tests, they destroy the 
effectiveness of the instruments. 


Selection of Achievement Tests. Although comprehensive achieve- 


ment tests tend to differ from one another in ways described earlier in 
the chapter, in one sense they are all alike: all the major, commercially 
distributed tests are carefully constructed and standardized. The one 
dimension on which they vary importantly is on the extent to which 
they measure general educational development rather than mastery of 
special topics. Some of the tests (particularly the STEP) employ many 
items that do not relate directly to any particular topics, but rather 
they concern the extent to which students can deal effectively with 
concepts. There are very good arguments for employing materials of 


these kinds, because, to a large extent, the over-all goal of education 
ing man. On the other hand, there is 


is to produce the thinking, reason! 
need to know how well students have mastered particular topics such 
as grammatical usage, history, and geography. It would be easy to ad- 
vise that both types of instruments be used, but schools are already 
fully burdened with the cost and time required to use achievement 
tests. 
Actually, concept. oriented and topic: oriented tests are probably not 
so different as they appear on the surface. Although the author knows 
of no study to support his claim, it is probably so that the two types of 
Measures would correlate 80 or higher. However, the small difference 
is important. It may be that concept. oriented achievement tests are 
somewhat better for predicting success in future schooling and for suc- 
cess outside of school. Also, they may provide useful information about 
how well the over-all goals of instruction are being met. However, for 
Most of the uses of achievement tests discussed in the previous chap- 
ter, conventional, topic-oriented achievement tests probably are best. 


Summary 


The purpose of comprehensive achievement tests is to measure the 
hool up to a particular point in time. 


Over-all progress of students in s¢ 


198 


Standardized Achievement Tests 


Many of the items on comprehensive achievement tests look much 
like those that the teacher employs for his own examinations. The two 
types of tests differ mainly in that (a) comprehensive achievement 
tests are more concerned with over-all progress, whereas teacher-made 
tests are more concerned with progress in particular areas of study 
over relatively short periods of time; (b) large-scale norms are avail- 
able on achievement tests that permit the comparison of individual 
students with students across the country; and (c) comprehensive 
achievement tests are constructed with much more care than is possible 
for most teacher-made tests. 

Corresponding to the increasing complexity of what is taught, the 
content of comprehensive achievement tests grows more complex in 
moving from the primary grades up through high school, At the pri- 
mary level, the major emphasis is on reading comprehension, word 
knowledge, and simple arithmetic skills. At succeeding levels, a 
broader array of arithmetic content is included, and materials are in- 
corporated to measure language skills and study skills. Beginning at 
about the fifth-grade level, some of the tests incorporate material re- 
lating to content areas. Typically this material is incorporated in two 
sections respectively concerning science and social studies. 

Only about eight comprehensive achievement tests are widely em- 
ployed in American schools. The major difference among them is that 
some employ items relating to content areas and others restrict them- 
selves to the core topics and skills. As was described in the chapter, 
there are good arguments both for and against the inclusion of mate- 
tial relating to content areas. Aside from this major difference, the 
tests are distinguished from one another by minor differences in con- 
tent and types of items employed, All the available commercially dis- 
tributed instruments are excellent, and it is difficult to choose among 
them. 


chapter 10 


Achievement Tests 


for Special Topics 


Previously a distinction was made between two kinds of tests for spe- 
cial topics: survey tests and diagnostic tests. The purpose of the survey 
test is to tell ow much the student knows about a particular topic; 
the purpose of the diagnostic test is to provide some insights into the 
work habits and special faults of a student’s approach to a topic. In 
practice it is difficult to draw fine distinctions between the two or to 
neatly categorize some instruments as survey tests and others as diag- 
nostic tests, Any test that divides the content into a number of parts 
provides some diagnostic information. Also, because of the appealing 
sound of the word, many tests are labeled “diagnostic” that really sup- 
ply little diagnostic information. The word “diagnostic” will be re- 
served for those tests which do give considerable amounts of informa- 
tion about work habits, examples of which will be shown later in the 


chapter. 


Survey Tests for Special Topics 


Survey tests for special topics are not different in principle from the 
subtests on comprehensive measures of achievement. For example, a 
survey test for reading will contain much the same types of materials 
included in the reading subtest of a comprehensive measure. The sur- 
vey tests for special topics tend to differ in only several ways from the 
respective subtests on comprehensive measures. The major difference 
is that survey tests for special topics usually cover the area in more 
detail than would be possible in comprehensive measures. For example, 
an achievement test for reading skills only would contain more items, 

ling, than would be possible in com- 


and cover more aspects of reac 
prehensive tests. A second difference is that many of the tests for spe- 
r the high school levels, measure 


cial topics, particularly those fo: 
achievement in special areas that are either not covered at all, or only 


199 


200 
Standardized Achievement Tests 


lightly covered, in comprehensive measures. This is the case for tests 
in special areas like economics, trigonometry, chemistry, and physics. 

In a sense, some tests for special topics are competitive with com- 
prehensive tests. That is, rather than use a comprehensive achievement 
test at the fourth-grade level, separate tests could be employed for 
reading, arithmetic, language skills, and others, Unless there is some 
good reason for employing tests for special topics, a hodgepodge of 
separate tests should not be substituted for a comprehensive measure. 
Much to the credit of the comprehensive batteries is the fact that all 
subtests are constructed according to the same general principles, and 
they are all standardized on the same students. Consequently, it is 
much safer to make comparisons among the scores on different sub- 
tests of the same comprehensive measure than it would be to make 
comparisons among different tests for special topics, e.g., tests for 
reading and arithmetic constructed by different test publishers. 

In some situations there are good reasons for employing achieve- 
ment tests for special topics. One is to obtain more detailed informa- 
tion about a child who does poorly on a particular subtest of a compre- 
hensive measure. For example, if a child does poorly on the reading 
subtest of one of the comprehensive achievement tests, some addi- 
tional information might be obtained by administering one of the spe- 
cial tests for reading. Another good reason for employing tests for spe- 
cial topics is to measure achievement in those special areas of study in 
high school that are not well covered on comprehensive measures. For 
example, if a school wants to know how well students are performing 
in chemistry, they would have to employ a special test for chemistry. 
Such special tests also are useful for counseling students who plan 
higher education. For example, if a student is planning to study engi- 
neering in college, it would be very helpful to know how well the 
student performs on special tests for mathematics, physics, and chem- 
istry. Other combinations of special tests would be useful for advising 
students on their future courses of study. 

Many different kinds of special tests are commercially distributed. 
To give some idea of the range of tests available, one can find tests of 
achievement in Hebrew, trigonometry, agriculture, handwriting, in- 
dustrial arts, and driver education. (Sources for these and other 
achievement tests will be cited in Chapter 17.) To more concretely 
illustrate the nature of achievement tests for special topics, representa- 
tive test materials will be described in the following sections. 

Reading Skills. Numerous special tests are available for reading. In 
principle these are very much like the reading subtests of comprehen- 
sive measures Typically, they require students to read paragraphs and 
answer questions about them. Also, typically they contain sections on 
word meaning and sentence meaning. Many of them measure speed of 


201 


Achievement Tests for Special Topics 


reading, which is not measured in most comprehensive tests. The major 
difference between special tests for reading and subtests for reading 
included in comprehensive measures is that the former tend to be 
longer and test more aspects of reading. A typical measure is the Iowa 
Silent Reading Tests (37), which includes an elementary battery for 
grades 4 to 8 and an advanced battery for higher grade levels. The 
advanced battery includes the following subtests: 


Test 1. Rate of reading and comprehension of prose 

Test 2. Directed reading of prose to answer particular factual ques- 
tions 

Test 3. Poetry comprehension 

Test 4. Vocabulary in different content areas 

Test 5. Sentence meaning 

Test 6. Paragraph comprehension 

Test 7. Location of information using an index 


The major reasons for employing special tests for reading skills are 
(a) if comprehensive achievement tests are not used at most age levels, 
to measure progress in that important topic and (b) even if compre- 
hensive tests are used frequently, to obtain additional information 
about students who perform poorly on the reading sections of compre- 
hensive measures. 

Mathematics. Separate tests for mathematics are used less fre- 
quently in elementary grades than separate tests for reading. The 
reason is that most comprehensive tests provide an adequate coverage 
of mathematics at the elementary levels, and there is less need for 
special tests. 

Special tests for mathematics come into their own at the high school 
level. Although at that level comprehensive tests have items relating 
d not to go into detail on any one topic, e. g., 
plane geometry. Consequently, to adequately test the mastery of par- 
ticular topics in mathematics, it usually is necessary to employ one of 
the special tests. Special tests are available in algebra, plane geometry, 
solid geometry, trigonometry, and other mathematics topics. The fol- 
lowing items used to measure achievement in trigonometry will illus- 


trate the nature of the tests. 


to mathematics, they ten 


The sine of an angle is .8. The cosine is 
a8 h. 36 C. 2 d. 6 


0 feet from the base of a tree. If a string were 
stretched from that point to the top of the tree, it would form an angle 
of 45 degrees with the base line. How tall is the tree? 

a. 45 feet b. 60 feet c. 90 feet d. 30 feet 


A ground line is extended 6 


202 
Standardized Achievement Tests 


Following is a diagram of a farmer’s triangular plot of land. 


A 


B 8 


The farmer lives at point A. The distance from A to C is 500 feet. The 
angle formed by line segments B-A and B-C is 25 degrees. How far would 
he go if the farmer walked from point A to point B? 

a. 1,183 feet b. 1,072 feet c. 1,000 feet d. 1,244 feet 


Natural Science and Social Studies. Although numerous special 
tests are available for individual topics in natural science and social 
studies, e.g., physics and economics, few schools can afford the time 
and expense to purchase and use the many tests needed to fully cover 
the range. A good compromise is to use two omnibus tests that broadly 
cover the natural sciences and the social studies, respectively. Typical 
of such tests are the General Achievement Tests for Natural Science 
and Social Studies (19) of the Cooperative Test Division of the Edu- 
cational Testing Service. Each of the tests is divided into two parts, 
consisting, respectively, of (a) terms and concepts and (b) compre- 
hension and interpretation. Part (a) is measured by multiple-choice 
items. In part (b) students are presented with paragraphs, maps, and 
graphs and required to give explanations and interpretations. Ilustra- 
tive items from part (a) of the test for natural science are: 


The rate at which an automobile increases in speed after starting is called 
its 

a. velocity d. kinetic energy 

b. acceleration e. inertia 

c. momentum 


Silvering the inner surfaces of a vacuum (thermos) bottle decreases heat 
loss by 

a. air leakage d. convection 

b. conduction e. radiation 

c. vaporization 


Vegetarians should eat relatively large quantities of milk, peas, beans, 
and wheat because these foods all contain 
a, vitamins d. carbohydrates 


b. proteins e. minerals 
c. fats 


203 


Achievement Tests for Special Topics 


Sample items from part (a) of the test for social science are: 


An excise tax is a tax on 

a. inherited wealth 

b. real property 

c. exported commodities 

d. earned income 

e. commodities manufactured for domestic distribution 


Which of the following crimes is a misdemeanor? 


a. murder d. kidnapping 
b. arson e. exceeding the speed limit 


c. burglary 


Which of the following cities suffered a loss in trade during the fifteenth 
and sixteenth centuries because of the development of new routes to the 


Far East? 


a. Venice d. Paris 
b. Lisbon e. Madrid 
c. London 


Vocational Courses. One important use of achievement tests is for 
vocational topics such as home economics, industrial arts, and secre- 
tarial training. Of course, such topics are not covered at all in com- 
prehensive achievement tests. Typical of the tests for vocational topics 
are those for secretarial training. Tests are available for typing, in 
which the student is required to type standard passages. Scores are 
obtained for both speed and accuracy. Tests for shorthand are avail- 
able which either require students to translate standard passages into 
shorthand or to mark errors in a shorthand translation. Other tests are 
riting business letters, bookkeeping, filing, 


available with respect to W. 
ation sources for these and other tests 


and office management. Inform 
are discussed in Chapter 17. 


Special tests for vocation: 
they provide objective standards as to whether or not students have 


mastered their skills. Such information is necessary for counseling stu- 

dents about vocational objectives and for recommending them for par- 

ticular jobs. Also. for want of better standards, the achievement tests 
` i 

are often used directly for the assignment of course grades. 


al studies are particularly useful because 


Diagnostic Achievement Tests 
nd much of their time trying to diagnose the 
work habits and particular difficulties of their students. Mrs. Brown 


notes that one student makes many reversal errors in reading, tending 
to substitute “was” for “saw, no for “on,” and others. Another stu- 


dent typically accents the first syllables of words even when that is 


Of course, teachers spe 


204 
Standardized Achievement Tests 


not correct, e.g, de’-light, ad’-mit, and se’-cure. A third student 
typically confuses certain letters of the alphabet, e.g., m and n, b and 


In arithmetic, the teacher notes that one student makes errors in 
calculating time because he operates as though there were twenty 
rather than twenty-four hours in the day. When asked how long it is 
from eight o'clock in the morning to eight o clock at night, he responds 
with “ten hours.” In multiplication another student typically misaligns 
successive rows, A third student repeatedly makes the same type of 
error in counting off decimal places in division. 

Being able to diagnose characteristic errors provides the teacher 
with valuable information in tailoring instruction to the needs of each 
student. Diagnostic achievement tests represent extensions of what 
teachers try to do every day in diagnosing the particular difficulties of 
students. The tests provide exercises and problems that maximize the 
possibilities for making errors and exhibiting poor work habits, and 
they provide techniques for observing and scoring what the student 
does. 

As was mentioned previously, diagnostic tests are limited to reading 
and arithmetic. Representative diagnostic tests in these areas will be 
described in the following sections. 

Reading. One of the most widely used diagnostic reading tests is 
the Gray's Oral Reading Paragraphs (36). The test consists of twelve 
short paragraphs, graded in difficulty from those appropriate to the 
first grade to those appropriate to the eighth grade, The nature of the 
test can best be illustrated by quoting sections from the directions for 
scoring:" 


Each pupil should be tested individually in a quiet place, free from 
distraction, and where other pupils to be tested will not hear the reading. 

Hand the pupil a copy of the standardized paragraphs and give the 
following directions: “I should like for you to read some of these para- 
graphs for me. Begin with the first paragraph when I say ‘Begin.’ Stop 
at the end of each paragraph until I say ‘Next.’ If you should find some 
hard words, read them as best you can without help and continue read- 
ing.” Pupils above the fourth grade should begin with paragraph 4, but 
are to be given full credit for the first three paragraphs, the same as if 
they had read them without any errors. However, if two or more errors 
are made in paragraph 4, ask the pupil to read the preceding paragraphs 
also. In case pupils in the first two grades hesitate several seconds on a 
difficult word, pronounce it for the pupil and mark it as mispronounced. 

While the pupil is reading, record: (a) the time required to read each 
paragraph, and (b) the errors made. 

(a) Note the exact second at which the pupil begins and completes 
the reading of a paragraph. Record the number of seconds required in 
the margin to the right of the paragraph. 

Material quoted by permission of the Public School Publishing Company. 


205 


Achievement Tests for Special Topics 


(b) The following paragraph illustrates the character of the errors and 
the method of recording them. 


st 
The sun piefced into my large windows. It was the opening of Octo- 


han 5 
ber, and thegsky was@)a dazzling blue. I looked out of my window Gnd) 
down the street. The white housd)of the long st§aight street were 


most painful to the eyes. The clear atmosphere alldwed full play to 


the sun's brightness. 
— — 


If a word is wholly mispronounced, underline it as in the case of “At- 
mosphere.” If a portion of a word is mispronounced, mark appropriately 
as indicated above: “pierced” pronounced in two syllables, sounding long 
a in “dazzling,” omitting the s in “houses” or the al from “almost,” or the 
r in “straight.” Omitted words are marked as in the case of “of” and 
“and”; substitutions as in the case of “many” for “my”; insertions as in 
the case of “clear”; and repetitions as in the case of “to the sun’s.” Two 
or more words should be repeated to count as a repetition. 

It is very difficult to record the exact nature of each error. Do this as 
here you are unable to define clearly the 


nearly as you can. In all cases w. 
specific character of the error, underline the word or portion of the word 


mispronounced. Be sure you put down a mark for each error. In case you 
are not sure that an error was made, give the pupil the benefit of the 
doubt. If the pupil has a slight foreign accent, distinguish carefully be- 
tween this difficulty and real errors. Each pupil should be allowed to 
continue reading until he makes 7 errors in each of 2 paragraphs. 


In addition to indicating particular kinds of errors that students make 
in reading, the test provides a system for converting total numbers of 
errors into grade-equivalent scores. However, as will be discussed 
more fully later in the chapter, such norms should be regarded with 
considerable suspicion. ; on 
Arithmetic. Typical of the diagnostic tests for arithmetic is Funda- 
mental Processes in Arithmetic (16). The student is presented with a 
series of graded problems in addition, subtraction, multiplication, and 
division. He is asked to “talk out” the solution to each problem. Items 
are scored both for correctness of solution and for the types of work 
habits exhibited. The work habits shown by each student are indicated 
on a chart, which is shown in Figure 10-1. Selected passages from the 


test manual! will illustrate the purpose and nature of the test: 


A standardized test in arithmetic will indicate whether a pupil is doing 
satisfactory or unsatisfactory work for a given school grade. Tt enables the 
teacher to identify those pupils who need special attention. However, 
i esented with the permission of the Public School 


The quoted material is pr 
“7 ; ake of brevity, the quoted material leaves out some 


Publishing Company. For the si 
passages and sections from the test manual. 


206 
Standardized Achievement Tests 


Teachers Diagnosis DIAGNOSTIC CHART derbe. 
tor Pupt_/3 FoR Bloomiagton, Ilinois 


INDIVIDUAL DIFFICULTIES 


FUNDAMENTAL PROCESSES IN ARITHMETIC 
Propared by G. T. Barwell and Lenaro Jeba 


6öà:3. 8 


Add. LA. C- Subt. ; Mult. __; Div. 


Teacher's preliminary ieee ee e 


ADDITION: (Place a check before each habit obierved in the pupil’s work) 0 
al Errors in combinations —a15 Disregarded column position 
a2 Counting als Omitted one or more digits 
ag Added carried number last al Errors in reading numbers 
at Forgot to add carried number als Dropped back one or more tens a 
a5 Repeated work after partly done ~4-219 Derived unknown combination from familiar ono 
— a6 Added carricd number irregularly 20 Disregarded one column 
—— a7 Wrote number to be carried -X a21 Error in writing answer 
— 28 Irregular procedure in column ——a22 Skipped one or ‘more decades z 
a9 Carried wrong number agg Carrying when there was nothing to carry 
10 Grouped two or more numbers — n24 Used scratch paper 
all Splits numbers into parts us Added in pairs, giving last sum as answer 
al2 Used wrong fundamental operation aas Added same digit in two columns 
—al3 Lost place in column —227 Wrote carried number in answer 
—n14 Depended on visualization — a28 Added same number twice 


Habits not listed above 


(Write observation notes on pupil's work In space opposite examples) 


10 | (5) E wwe 
„„ Bago SHE HE | 7 
7 7 


(2) EO 
z $ 52 ae 
g . 13 a 
— — bra E E L ee h — Gan 
HI 13 é 7 — 


(WALA #1) 


15 7 ; PS 
12 13 CV 978 46 „* ank 2 an Ak 
2 5 S ane Jg I ok Fare Je, 1 71 92 [& 78. Bann . 
77 Tr and S , 13 aud Sane | 173 78 anamer, omid 
18," | x 


7 7 7 «dandy 
eee 


3 8 
„ 7 24, 
ee (A K + 4 4 %, 18,19, 20, 21, 23, 23,24, 
ETT Padhas bond rn” e adnan a ah oe, a a5 e 
Corel nang nambas T 57 „ deak ome 
ib 18 3l b ( J 


Figure 10-1. Sample page from the Diagnostic Chart used with Fundamental 
Processes in Arithmetic. 


207 


Achievement Tests for Special Topics 


the marked limitation of such a test is that it does not tell why the pupil 
fails nor how he has made errors. Before the teacher can give effective 
help to a failing pupil, she must know exactly what the pupil does to 
cause his failure. She must understand the methods that he uses. These 
methods of work are so varied and complex that efficient teaching re- 
quires a systematic and organized scheme of diagnosing them. To illus- 
trate some of the varied ways in which pupils work, the following ex- 
amples are given. 

In adding a column of figures, the most common method is to proceed 
regularly either up or down the column. However, many pupils do not do 
this, but instead, they skip around in an apparently random fashion. One 
boy explained his method of adding by saying that he did not like to add 
and therefore he always added the most difficult numbers first in order to 
be through with them. Consequently, he added all of the nines, then all 
of the eights, then all of the sevens, and so on down to zero. Needless to 
say, in this amount of skipping around in the column, he overlooked 
some of the numbers entirely, and, consequently, got a wrong answer. 
Other children tried to make as many easy combinations as possible, for 
example, a 4 and 6, regardless of whether the 4 and 6 appeared together 
or at opposite extremes of the column. While grouping numbers appears 
to have some advantages for skilled accountants, observation of children s 
work indicates that such attempts at grouping more frequently result in 


failure than in success. 

Another type of procedure in addin 
girl in the sixth grade. She had neve: ibina 
ficiently well, consequently she constantly resorted to counting in order to 
get the proper answer. For example, in adding 7 and 9 she worked as 
follows: “Well, 7 plus 4 is 11, 7 plus 5 is 12, 7 plus 6 is 13, 7 plus 8 is 
14, 7 plus 9 is 15.” She made an error, due to skipping the combination 
7 plus 7, but she failed to notice it. Throughout her work she continuously 
added in this fashion. Needless to say, she can never do effective work in 


addition until these extravagant and time-consuming methods are elimi- 
nated. The teacher thought that she was merely a slow adder and had no 
f Ithough she was in the sixth grade 


idea of the method she was following, altho 
and passed the following year into junior high school, where the teaching 
of addition would probably never be mentioned again. 

Still a different type of case Was a child who did approximately one- 
half of the examples wrong in a test of addition. In analyzing her work 
it was found that in every case the error was due to one cause, namely, 
lack of knowledge of how to carry. Obviously, the proper treatment of 
this case is specific teaching of how to carry rather than simply the ap- 


plication of more drill in addition. 


g is illustrated by the work of a 
r learned her combinations suf- 


Directions for Diagnosis 

It is recommended that the Diagnostic Chart be 
are doing unsatisfactory work in arithmetic. The 
“a list of the pupils whose work is to be 


Individual Work. 
used with all pupils who 
most economic method is to make 


208 
Standardized Achievement Tests 


analyzed and then to proceed systematically with the diagnosis, giving 
the other children in the group practice exercises or seat work until the 
diagnoses are finished. The diagnosis should be made individually and 
should cover only one of the four fundamental operations at a given 
time. For example, after practice exercises or seat work has been as- 
signed to the class, the teacher should select the child to be diagnosed 
and sit down with him at her desk or at a table in the corner of the room. 
She should make the child feel as much at home as possible, since the 
success of the diagnosis depends upon how intimately the teacher be- 
comes acquainted with the details of the pupils method of work. Since 
the causes of failure in arithmetic are generally due to poor methods of 
work, successful teaching depends, first of all, upon finding out just what 
methods are used. 

Procedure. After the teacher and the pupil are seated at the table 
where the work is to be done, the pupil should be provided with a Work 
Sheet and the teacher with the Diagnostic Chart. The blank spaces at the 
top of the Chart, giving the pupil’s name, age, grade, etc., should be 
carefully filled in before proceeding with the diagnosis. The teacher 
should then direct the child to proceed with the examples in the operation 
to be observed, as for example, addition. She should tell the child to 
work the examples in the way that he ordinarily does and to write the 
answers in his usual manner. He should be told that the teacher wishes 
to know just how he gets his answers and, for this reason, that he is to 
do as much of his work as he can aloud. Tell him “to do all of his think- 
ing aloud.” A careful explanation by the teacher, together with an illus- 
tration by her, is ordinarily sufficient to indicate to the child exactly what 
is wanted, and after the first example or two, the child usually proceeds 
in a very natural fashion. 

Since the success of this type of diagnosis depends upon discovering 
how the child works in his normal manner, the teacher should not make 
any attempt in the diagnostic process to suggest ways of working or to 
correct the pupil’s bad habits of working. This should be done later. In 
the diagnosis the aim is to find out just how the pupil works when he is 
working independently. The child should be made to feel as natural as 
possible, and a cordial relationship between the teacher and the child 
during this period is necessary. 

As the child works, the teacher should check on the Diagnostic Chart 
the types of habits which occur, at the same time recording the child’s 
procedure in the space opposite the examples on the teacher’s chart. 
The most satisfactory way to do this is to make a record of the habits ob- 
served in the exact words of the pupils, at least for the first few times. If 
the habit appears later with other examples, it is sufficient to refer back 
to the earlier procedure. The results of the diagnosis should be that the 
teacher has a clearer knowledge of the specific habits which are respon- 
sible for the pupil’s poor work. re 

Distinction between Diagnosis and Testing. One particular diay 
tion between the method of diagnosis and the method of testing shoulc 
be pointed out. After a test is given the final score is computed, which 


209 


Achievement Tests for Special Topics 


indicates the grade of work which the pupil is doing. Ordinarily, atten- 
tion centers simply in the score which is used for purposes of classification. 
In the method of diagnosis there is no final score. The procedure is not 
used for purposes of classification, but rather for purposes of teaching. 
Consequently, the desired result is a very clear understanding on the 
part of the teacher of just how the pupil does his work in order that more 
effective teaching may follow. Since this is the case, the teacher should 
not be satisfied simply with making the diagnosis and checking the items, 
but rather, she should study carefully the characteristics of the pupil’s 
work and formulate in her own mind the most appropriate methods of 


teaching for such a pupil. 


The manual lists numerous work habits and provides examples of 
each. Some illustrative examples are: 


Forgot to add carried number. 


Example: 268 The error here is due to neglecting to carry. The sub- 
961 ject said, “g and 1 is 9; 6 and 6 are 12; 9 and 2 is 11.” 
1,129 
In subtraction did not allow for having borrowed. 
Example: 528 “4 from 8 equals 4; 6 from 12 equals 6; bring down 
64 the 5.” 
564 
In multiplication made error in position of partial products. 
Example: 97 A pupil placed the second product directly under the 
12 first in this example. 
194 
97 
291 


Use of Diagnostic Tests. Some points about the use of diagnostic 
tests should be obvious from the examples shown above. First, such 
tests ordinarily are given only to children who are having trouble with 
either reading or arithmetic. They are too time consuming to routinely 
apply to all students. Second, diagnostic tests require more skill of the 
teacher than is the case with survey achievement tests. In order to 
skillfully present the materials to students and to accurately mark 
errors as the student goes along, a considerable amount of practice is 
required. Also, the success of the diagnostic testing session depends to 


a considerable extent on the teacher's personality and his ability to 


handle children, If the student is afraid of the teacher, or if the teacher 
from shy or hostile stu- 


is not sufficiently skillful to obtain cooperation 
dents, the diagnostic testing session will be of little avail. 

An inherent difficulty of diagnostic tests is that they try to measure 
so many things at once. To adequately measure all the components of 


210 
Standardized Achievement Tests 


reading, for example, would require a very long and involved instru- 
ment. Few diagnostic instruments have the amount and breadth of 
material required to adequately measure all the constituent skills in- 
volved, an exception being the Fundamental Processes in Arithmetic, 
described previously. 

One important caution in using diagnostic tests is that none of the 
available instruments has been standardized on a representative cross 
section of students in this country. Consequently, those instruments 
that deal with scores in the form of grade equivalents and other types 
of norms are quite misleading. As was said in the material quoted 
from the manual for the Fundamental Processes in Arithmetic, num- 
bers of correct and incorrect answers are not what is important on 
diagnostic instruments, but, rather, how the student approaches the 
task. Although most of the available diagnostic instruments are far 
from perfect, they do provide useful supplements to teachers’ informal 
diagnoses of the work habits of students. 


Summary 


In this chapter were discussed two kinds of tests for special topics: 
survey tests and diagnostic tests. Survey tests for special topics are 
much like the subtests on comprehensive measures, except that they 
tend to be longer and incorporate a wider variety of materials. Al- 
though it is not wise to use a collection of tests for special topics in 
preference to a comprehensive test, tests for special topics have two 
purposes. The first is to learn more about students who perform poorly 
on a particular subtest in one of the comprehensive measures, e. g., 
reading. The second is to measure progress in topics taught in high 
school that are given little or no coverage in comprehensive measures, 
e.g., trigonometry and shorthand. ? 

Diagnostic tests are available only for reading and arithmetic. Their 
purpose is to provide the teacher with insights into why particular stu- 
dents are having difficulty with those two topics. Diagnostic tests ( ones 
that truly fit the name) are rather different from comprehensive tests 
and survey tests for particular topics. They must be administered in- 
dividually, and informative results depend considerably on the teacher 8 
skill. Rather than consider them as “tests,” in the more formal meaning 
of the term, it is best to consider diagnostic instruments as exercises 
which help teachers recognize the work habits and particular faults 
that students have in reading and arithmetic. 


part IV 


Prediction and Trait 
Measurement: Human 
Abilities 


Parts II and III of the book were concerned, respec- 
tively, with teacher-made tests and achievement tests, 
both of which are outstanding examples of assess- 
ments. In Chapter 2 it was said that, in addition to the 
assessment function, two other important functions are 
served by tests: prediction and trait measurement. Al- 
though teachers are not as involved with these as they 
are with assessment instruments, they should know 
enough about them to interpret the results of commer- 
cially distributed instruments used as predictors and 
research results concerning trait measures. Both Parts 
IV and V will be concerned with prediction and trait 
measurement, Part IV with human abilities, and Part V 
with interests, attitudes, and personality character- 
istics. 

How do you distinguish between ability and non- 
ability (personality, attitudes, interests) measures? In 
some places the line is quite blurred and the two kinds 
of measures blend, but for most kinds of tests with 
which we deal, some important distinctions can be 
made. Ability tests concern how well an individual can 
perform. Typically students are asked to solve twenty 
arithmetic problems in a short time, memorize a list of 
words, or comprehend the meanings of paragraphs. 
The student is keyed to do the best that he can, and 
what constitutes success and what constitutes failure 


are clearly understood. 
How would the tests mentioned above differ from 


tests containing items like, “Do you usually lead the 


discussion in group situations?” “What does this ink 


212 


Prediction and Trait Measurement: Human Abilities 


blot look like to you?” “Which would you prefer to 
have as a neighbor: an Englishman or a Frenchman?” 
“What would you rather do: make a speech or keep 
records for a club?” The first two items are typical of 
those that have been used in personality tests, the third 
is typical of items appearing in measures of attitudes, 
and the fourth is typical of items appearing on interest 
tests. 

In this latter type of item there are no obvious right 
and wrong answers. Tests cannot be scored in such a 
Way as to report to the student how many correct an- 
swers he gave. Items of this kind (nonability items) 
are intended to determine how the individual typically 
behaves: his typical behavior with respect to other 
people (which is part of what we call personality ), his 
typical preferences for other people (attitudes), and 
his typical choice of activities (interests). It is best 
said then that the nonability tests concern typical be- 
havior, 

To make a distinction between tests of ability and 
tests of typical behavior is more than mere hair 
splitting. Different kinds of problems arise when con- 
structing, using, interpreting, and validating the two 
kinds of instruments. For example, on ability tests, the 
test administrator must guard against cheating. If a 
student pecks in the text or at another student's paper, 
he might obtain the correct answer. But because there 
are no correct answers in measures of typical be- 
havior, cheating, as such, is not an important problem. 
Instead, the student is likely to “fake,” to give answers 
which sound good even if they do not typify his be- 
havior. 

Another difference between the two types of meas- 
ures is that it has proved much easier to find valid 
predictors among ability tests rather than among 
measures of typical behavior. For example, ability 
tests have been used successfully for many years to 
select college students, but valid personality tests a 
the selection of college students are still not pa 

It has proved much casier to obtain “construct ie 
ity” for abilitv-type instruments than for mesmes 8 

saf me ; c of which is the correlation 
typical behavior, one index of which is . 
between different proposed measures of the a Bite 
For example, different tests of intelligence ten 


213 


Prediction and Trait Measurement: Human Abilities 


relate highly, so highly that it is hard to choose among 
them; in contrast, different measures of some of the 
personality characteristics, e.g., “rigidity,” correlate so 
low with one another as to raise serious questions 
about their validity. The distinctions above between 
measures of ability and measures of typical behavior 
will be further explained in Parts V and VI. 

The measures to be discussed in Part IV mainly are 
important in helping to make decisions about the 
“readiness” of students for particular types of educa- 
tional activities. Typical decisions concerning readi- 
ness are those involved in (a) admitting an underage 
child to kindergarten, (b) placing a slow learner in a 
special remedial class, (c) permitting a high school 
student to enter a special accelerated program of 
study, and (d) advising a student about entrance into 
college. To the extent to which tests of human ability 
actually aid in making such decisions about readiness, 


they perform a very valuable service. 


~~ 11 


Factors of Intellect 


Ted Bronson was the smartest student that Miss Brown had ever had 
in the fourth grade. Now he has moved on to the fifth, and Miss Brown 
Wonders how he is doing in Mrs. Martin’s class. At lunch Mrs, Martin 
says that Ted is indeed bright but not quite as bright as Billy Bern- 
ons such as this, we frequently fail to ask an im- 
Portant question: Are they bright in the same ways? We too frequently 
assume that there is only one factor (dimension or yardstick) of in- 
telligence; and after students have been ordered along it, there is 


nothing to say about their capabilitie 
In contrast to the way in which 


stein. In conversati 


8. 
we teachers are prone to make a 
Simple ordering of students with respect to over-all ability, athletic 
coaches are more discriminating when talking about members of their 
track teams, If you ask the coach at Woodlawn High if Russell Husek 
is “a good athlete,” a typical response would be, “Well, he is very good 
at short-distance running, hundred-yard dash, and other sprints, but 
he does not do so well on the longer runs; also he is a pretty good pole 
vaulter.” Describing another member of the team, the coach might say 
that he is excellent in all the jumps—broad jump, high jump, and pole 
vault, but lacks the wind to make a good runner. 

In previous chapters the word “intelligence” has been used with 
Some misgivings because it suggests that there is only one general 
ability, rather than different types of ability. If intelligence is perfectly 
general, the person who can do one type of intellectual problem well 
can do all other kinds of problems well, and the person who does 
poorly on one type of problem will do poorly on all others. If abilities 
are perfectly general, the child who easily learns to solve arithmetic 
problems will have the same facility in learning spelling, geography, 
and other topics- Another possibility is that human abilities are com- 
pletely “specific,” that there are no correlations among different in- 


tellectual tasks. Then, if we know that a particular child is very adept 


olr 


216 
Prediction and Trait Measurement: Human Abilities 


at learning arithmetic, that offers no basis at all for predicting how 
well he will do in spelling or geography. Taking the case further, if 
abilities are completely specific and unrelated to one another, the child 
who can add numbers very quickly might be slower than the average 
child in doing multiplication. 

Human abilities are neither completely general nor completely spe- 
cific. The real story lies between these two extremes. In this chapter 
we will discuss the history of the problem, related research, and some 
of the different abilities which underlie human intellect. 

Faculty Psychology. In the early nineteenth century the belief was 
prevalent that human behavior was determined by a large number of 
separate capacities, or faculties, as they were called. There were pro- 
posed faculties of attention, memory, reasoning, will power, esthetic 
appreciation, and many others. It was the early belief that each faculty 
resided in a particular brain location, and that “bumps on the head 
were prognostic of the strength of particular faculties, Although the 
anatomical theories were soon discarded, the belief in a large number 
of specific human faculties lingered throughout the remainder of the 
nineteenth century. Faculty psychology represented the extreme of 
the point of view that human abilities are highly specific. 

Binet and General Intelligence. In France at the turn of the cen- 
tury, Alfred Binet (10) studied the measurement of intelligence. At 
first he worked in line with the faculty school and sought to measure 
intelligence through many simple physical and behavioral indices. 
Some of his “tests” concerned suggestability, size of the cranium, tac- 
tile discrimination, graphology, and even palmistry. Like the others 
who were trying to measure human abilities, he found that these sim- 
ple functions do not measure intelligence, as we commonly think of it. 

Binet abandoned these efforts and, instead, adopted a “molar” sone 
ception of intelligence. For practical purposes, he thought that it 
would not be possible to measure all the simple skills that underlie in- 
telligent behavior. Instead, it would be more feasible to study the end 
products of intellectual functioning. In other words, rather than go 
back and try to find out why some people behave more intelligently 
than others, Binet sought to measure the extent to which individuals 
could deal intelligently with their present environments. Te defined 
intelligence as “the tendency to take and maintain a definite direction; 
the capacity to make adaptations for the purpose of obtaining a de- 
sired end; and the power of auto-criticism.” : 

Working with Simon, Binet developed the first prae al test of gen- 
eral ability (or intelligence, as it is usually called). ee 
thirty items which concerned, variously, following simple directions, 


defining words, constructing sentences, and e about 
the correct behavior in real life situations. The Binet-Simon scale has 


217 
Factors of Intellect 


gone through a number of revisions since that time. The scale set the 
tone for most measures of intelligence to follow, and even our modern 
Measures of general ability bear many resemblances to those early 
efforts, 

An important point to note about the Binet-Simon scale, and all 
Measures of general ability to follow, is that the child is provided with 
only one score, which is intended to represent his mental age. (More 
will be said about the scoring of tests of general ability in the next 
chapter.) Because each child receives only one score, it is succinctly 
assumed that intelligence is general, or, in other words, that only one 
factor is involved in the test items. If there are, say, two kinds of in- 
telligence involved in the test, the use of only one score is somewhat 
misleading. Two children could make the same score, not because 
they are alike with respect to their abilities, but because one child is 
high in one of the factors and low in the other, and vice versa for the 
other child. The same assumption of the generality of human abilities 
is involved in all the tests of general ability (intelligence tests ) which 
have followed from Binet’s early work. Binet’s test and all those to fol- 
low represent the extreme opposite point of view from the faculty 
school, that intelligence is perfectly general and that one over-all index 
is all that is required to indicate a child’s intellectual standing. 


Factor Analysis 
Rather than argue over whether intelligence is general or is divided 
:perimental and statistical procedures 


up into many specific abilities, ex 
are available which will help answer the question. Many different 
and correlation coefficients can 


kinds of tests can be given to students, 
be obtained for all possible pairs of them, As was mentioned in Chap- 


ter 4, the correlation coefficient indicates the extent to which two tests 
Measure the same thing. If many different kinds of mental tests were 
given to a large number of students and the correlations among them 
Were very high, it would be evidence that intelligence is completely 
general. In other words, all tests tend to measure the same thing. On 
the other hand, if the correlations among the tests were all zero, it 
would show that each test measures something different from the 
others, and this would be evidence that intelligence is completely spe- 
cific to the tasks involved. 

A look at some correlation coefficients will show how the question 
of the generality of intellectual functions is solved. Table 11-1 shows 
the intercorrelations of six tests. The tests are labeled A through F. 
The table shows all possible intercorrelations among the tests. For ex- 
ample, the table shows that tests A and B have a low correlation, .19 
to be exact. There is a high correlation between tests A and C, .64. 


218 
Prediction and Trait Measurement: Human Abilities 


Test D correlates only 21 with test A. The highest correlation of test 
D with any other test is .67, with test F. In the table the blank spaces 
in parentheses are at points showing the correlations of tests with 
themselves, Of course, any test correlates perfectly with itself, at least 
regarding the particular set of scores obtained on any one occasion. 
What conclusion would you come to by looking at Table 11-1? Do the 


Table 11-1: Correlations among Six Tests 


A ( ) 19 .64 21 72 05 
B 19 000 22 61 ＋2 353 
G 64 22 (% 14 56 28 
D 21 „ 11 28 67 
B 72 12 50 28 (% 1 
F 5 55 28 67 11 () 


tests all measure the same thing, or do they each measure different 
traits? Some of the correlations are quite high, for example, there ga 
correlation of .72 between test A and test E; other correlations are 
close to zero, ill 
A rearrangement of the order in which tests appear in the table wi 
help clarify the question regarding the amount of overlap between 
tests. In Table 11-2, the order of appearance of the tests has been re- 


Table 11-2; Rearrangement of Correlations from Table 11-1 


A E C D B F 
A ( ) .72 64 21 19 05 
E 72 (% 256 28 12 11 
Cc 64 56 ( ) 144 22 28 
D 2¹ 8 14 ( ) .64 67 
B 19 12 22 64 0 555 
V. 05 11 28 67 55 C ) 


arranged to show those tests which tend to correlate highly bong 11 
another. When presented in this way, it can be seen that ey 1 8 
and C tend to correlate highly with one another as do i > Tea 
F. Correlations between the two groups of tests are very 5 - 
are drawn in the table to demonstrate the tenang ab he ied 
divide up into two groups. What we have — 5 a 3 — 
different types of tests, each of which is said to for hone ne ah el 
tor. Tests A, E, and C relate to one factor, and any s 5 
very well on one of these tests is likely to do well on has 1 5 oris Dest 
D, B, and F form another factor, and any student a rid Sti 
on one of them is likely to do well on the others. 3 


219 
Factors of Intellect 


dent does well on the tests representative of one cluster or factor, it 
provides little information as to how well he will do on the other. 
In this example, tests A, E, and C all concern verbal comprehension: 
vocabulary, grades in English, and scores on a reading comprehension 
test. Tests D, B, and F, respectively, are scores in addition, multiplica- 
tion, and the solution of algebraic problems. The second factor con- 
cerns the ability to perform arithmetic operations. 

Studying correlation tables, such as those shown previously, allows 
us to determine how many clusters, or factors, are involved in intellec- 
tual tasks. The example presented above is a very simple one: only 
two factors are involved, and those are quite easily seen by inspecting 
the correlation table. If the correlation results were always so straight- 
forward and studies contained no more than six tests, it would not be 
necessary to apply more refined procedures to the results. However, 
results are seldom as neat as those presented above; and instead of 
Studying only six tests at a time, we often study as many as fifty or 
More tests in one table. In those instances it is no longer feasible 
simply to look at the correlation table and rearrange the ordering of 
the tests in such a way as to show the dominant clusters. Instead, it is 
necessary to apply some mathematical procedures which identify the 
Major factors and indicate the extent to which each test belongs to 
each factor, The mathematical procedures which are so applied are 
referred to as factor analysis. 

During the last half century, many different factor-analytic studies 
have been performed, not only on tests of human ability, but on 
Measures of personality, interests, attitudes, and many other kinds of 
individual differences. Each factor-analytic study represents a very 
large undertaking. First, it is necessary to compose all the tests that 
will be used, and as was said previously, these may number more than 

up of tests must be given to a large number 


fifty. Then the entire gro 
ore than three hundred. After the tests are 


of persons, preferably m 
Scored, all possible correlations must be computed among them, result- 


ing in tables of the kind shown above. Then the mathematical proce- 
dures of factor analysis must be applied. The end result is a number 
of factors showing the dominant clusters among the tests. The factors 
that have been found in each area of individual differences are said 


to constitute a “structure, or, as one might say, a map describing the 


common tendencies among tests. 


Factors of Ability 

tor analyses performed during the last half century 
he answer to the question of the relative generality 
uman abilities. The results show that neither extreme 


The many fac 
have provided tl 
or specificity of h 


220 
Prediction and Trait Measurement: Human Abilities 


point of view is correct and that a middle ground must be adopted. 
Arguing for the generality of abilities is the now well-established fact 
that correlations among tests of ability are almost always positive, 
even if small in some cases. For example, tests as disparate in appear- 
ance as vocabulary, memory of digits, and mechanical information 
all correlate positively, if no higher than about .20 or .30. It would 
be rare indeed to find one type of human ability that correlated nega- 
tively with another. If that were found, it would indicate that, statis- 
tically speaking, people who performed well on one task would tend 
to perform poorly on another. Human abilities all tend to go together, 
even though in some cases the statistical relationships are very weak. 
These findings offer partial confirmation of the generalist point of 
view, as represented by Binet and those who followed in his footsteps. 

Factor-analytic results have also provided confirmation for the multi- 
factor point of view. In addition to the tendency of all tests of human 
ability to correlate positively with one another, there are definite 
clusterings of the tests as shown by the correlations which have been 
obtained, For example, all tests involving the ability to understand 
words, such as reading comprehension, vocabulary, and grades in 
English, tend to have high correlations with one another, averaging 
60 or more. Similarly, all tests involving numerical computations, such 
as addition, subtraction, multiplication, and finding square roots, tend 
to correlate highly with one another and thus form another cluster or 
factor. Correlations between the two kinds of tests are positive, 
typically averaging .30 or higher, but correlations between the mem- 
bers of the two clusters are not nearly so high as the correlations 
within the clusters. This indicates that the members of each cluster 
tend to hang together and measure something that is partially separate 
from what is measured by the other cluster or factor. 

Factor-analysis studies have shown that there are dozens, and per- 
haps even hundreds, of such clusters or factors. However, most of 
these factors are concerned with highly specialized activities. For 
example, one such highly specialized factor concerns the effect of 
certain types of visual illusions on perceptual judgment; another con- 
cerns the ability to memorize digits presented in serial order. A great 
deal more research needs to be done before it will be possible to say 
which of these factors are important and in which situations. Some 
factors have proved useful in predicting success in school, in industry, 
and in military settings, but the bulk of them have been used insuffi- 
ciently to determine their importance. A great deal of research has 
been done on the factors that make for success in school at all levels, 
and it has been found that only a few of the factors are very ii 
portant. The following sections will describe some of the most 
important factors, including those which are important for schoolwork 
and for vocational activities. 


221 
Factors of Intellect 


Verbal Factors 


The most important factors relating to schoolwork concern the 
abilities to understand, use, and deal with written and spoken lan- 
guage. As is true of all the types of factors which will be discussed, 
there are many possible verbal factors that can be found by exhaus- 
tively analyzing many different types of verbal tests. However, only 
two of these seem to be very important for schoolwork. They are 


verbal comprehension and verbal fluency, which will be discussed 


in turn, 

Verbal Comprehension. 
the ability to understand written 
prehension represents most of what we refer 
Although the factor extends far beyond sheer vocabulary, a vocabulary 
test provides a good measure of verbal comprehension. 


The most important verbal factor concerns 
and spoken language. Verbal com- 
to as “reading skill.” 


Typical items: 


1. Which one of the following w 
salutation? 
a. offering 
b. greeting 
c. discussion 
d. appeasement 


2. Which one of the following words is most nearly 
languid? 
a. unemotional 
b. sad 
c. energetic 
d. healthy 


ords means most nearly the same as 


the opposite of 


S abili rapidl 

Verbal Fluency. Verbal fluency concerns the ability i pi y 
Produce words and sentences. It can be thought of as the rate o 
production aspect of verbal ability in contrast to verbal compen a 
sion, which concerns the depth of understanding of verbal material. 


Typical items: 
l. Write as many names of foods as you can in the next two minutes. 


2. In cach of the following rows write three words that mean almost the 
same as the given word. 

small A ot 
helpful — — 


kind . — 


222 
Prediction and Trait Measurement: Human Abilities 


Verbal comprehension comes into play when rather complex words, 
sentences, and paragraphs are being dealt with. Verbal fluency comes 
into play when the verbal material is relatively simple and when 
fluidity of expression is at issue, The two types of abilities are some- 
what correlated. Correlations of about .40 or .50 are typically found 
between the tests used to measure verbal comprehension and verbal 
fluency. On the other hand, the two types of abilities are far from 
perfectly correlated. A child can understand what he reads very well 
but have great difficulty in explaining it because of his lack of verbal 
fluency. On the other hand, the talkative child who produces a torrent 
of words in ordinary conversation does not always have the depth of 
comprehension to match, Teachers are sometimes fooled by the child 
who is quite facile in expression but who does rather poorly when 


he is required to read and understand material or to analyze poetry 
or essays. 


Number Skills 


Numerical Computation. One very clear factor of number skills 
has been found in many different studies. It concerns the speed of 
solving arithmetic problems of all kinds—addition, subtraction, multi- 
plication, division, finding square roots, and others. 


Typical items: 


246 8,754 
+ 943 — 381 16 X 22 = 2844 = 


It should be clearly understood that not all problems containing 
numbers measure numerical computation. Numbers also appear in 
many of the reasoning tests, which will be discussed in the next 
section, but such problems do not concern numerical computation. 
The numerical computation factor comes into play when some complex 
(for the age group) numerical solutions must be obtained. If the 
numbers involved are very simple and they are included only as a 
way of providing a useful method of expressing the solution to a 
problem, then numerical computation, as such, may be very unim- 
portant. This distinction will be made clearer when reasoning factors 
are discussed. 

Tests of numerical computation usually show smallish (about 40) 
correlations with measures of verbal ability. As was mentioned previ- 
ously, all measures of human ability tend to correlate positively with 
one another, but in this instance, numerical computation has a far 
from perfect relationship with other factors of ability. Numerical com- 
putation is not the same as mathematical reasoning which is involved 


223 
Factors of Intellect 


in simple algebraic problems. Although algebraic problems require 
some numerical computations, these are usually relatively simple. 
Instead, algebraic problems relate more prominently to the reasoning 
factors, which will be discussed below. Because a child is very good 
in number skills in the elementary grades, this does not mean that the 
child is likely to be good in mathematical topics, such as algebra, 
geometry, and trigonometry, later. 


Reasoning Factors 


Although many different studies have been performed of reasoning 
tests, the factors involved are still somewhat unclear. Reasoning is 
a complex domain in which the abilities involved tend to blend in 
different ways in different tests, making it hard to separate the reason- 
ing factors from one another and to find good measures of any of 
them. The most clearly determined factors are as follows: 

General Reasoning. The most common and most commonly found 
factor of reasoning is concerned with the ability to invent solutions 
to problems, Arithmetical reasoning problems are most characteristic 


of the factor. 


Typical items: 


1. If two men can build one house in twelve weeks, how many houses 


can twelve men build in two weeks? 
7 quarts of water from a stream if you 


2. How would you get exactly 
e 3-quart container? 


had one 5-quart container and on 


even though such simple algebraic 
problems involve numbers, the main ability is not that of numerical 
Computation. In order to solve the problems, the student must invent 
a solution, grasp some principle by which each can be solved. 
Deduction. The deduction factor is concerned with the drawing 
of conclusions, as in logical syllogisms. In this type of reasoning there 
1s nothing in particular to be discovered or invented, the ability being 
Concerned with evaluating the implications of an argument. 


As was mentioned previously, 


Typical items: 


L. John is younger than Fred. r 
Bill is older than Fred; therefore, Bill is than John. 


2. A student has 10 marbles. No one else in his class has 10 marbles. 


This means that 


a. no one else in the cl 
b. all the other students have less th 


ass has marbles 
han 10 marbles 


224 
Prediction and Trait Measurement: Human Abilities 


c. some of the students have less than 10 marbles 
d. some of the students have more than 10 marbles 
e. only one student has exactly 10 marbles 


Seeing Relationships, A third factor of reasoning has appeared 
often enough to be cited here but is not as firmly established as gen- 
eral reasoning and deduction. The factor involves the ability to see 
the relationship between two things or ideas and to use the relation- 


ship to find other things or ideas. This factor is best represented by 
verbal analogies and design analogies, 


Typical items: 


Ship is to sail, as automobile is to 
a. ship 

b. seat 
c. motor 
d. wind 
e. driver 


is to g as is to 


Memory Factors 


As is true of the other areas of human ability that we have dis- 
cussed, there is more than one type of memory. Some of the better 
established factors of memory are as follows: 

Rote Memory. The best-established 
ability to remember simple association 
no importance, 


factor of memory concerns the 
s where Meaning is of little or 


Typical item; 


The student is given a list of names, each of whi 
ber. He is given a minute or so to memorize 
which name, Then he is told to turn the p 
the list of names without the numbers. The 
the proper numbers next to the names. 


ch is paired with a num- 
which number goes with 
age. The next page contains 
student is instructed to write 


225 
Factors of Intellect 


ory are of the same general 
Id be such as pairing colors 
ographs with geometrical 


Other items which concern rote mem 
kind as the one shown above. They wou 
with words, initials with last names, and phot 
forms, 

Meaningful Memory. There is substantial evidence to indicate that 
there is a factor involved in the retention of meaningful relationships 
which is separable from rote memory. The factor of meaningful mem- 
ory appears when the student is requested to memorize sentences, 
meaningfully related words, and lines of poetry. 


Typical items: 
1. The student is asked to read and try to remember a list of sentences 


like the following: 
John repaired the wagon by welding the broken axle. 
The list of sentences is taken away and the student is then given the 


same sentences with one or more of the words deleted from each, like 


the following: 
John repaired the wagon by welding the broken 1. 
The student is shown a list of meaningfully related pairs of words 


such as the following: 
dog—bark 
shoe—leather 
hard—candy 
small—box 


bo 


The list is taken away and the student is presented with only one 


member of each pair as follows: 


dog 5 ae 
shoe ——— 
hard {2 = 
small — — 


veral other memory factors. A number of 
investigators have reported a memory span factor concerning the 
ability to recall perfectly for immediate reproduction a series of 
unrelated items. A typical item would be to read a series of five to 
a dozen digits and ask the student to give the digits back in their 
exact order, There is some evidence for a visual memory factor in 
which the ability to grasp the relationships within a picture or pattern 
is important. A typical item would consist of showing an individual å 

member the details. Then the 


landscape picture and asking him to r 
picture would be taken away, and the student would be presented 


questions like “How many sheep were in the picture?” “What was the 
boy handing to the man?” “Where was the swing located?” The visual 
memory factor might be related to the ability to remember faces, 
license numbers, and witnessed events. ‘ 


There is evidence for se 


226 
Prediction and Trait Measurement: Human Abilities 


Spatial Factors 


Spatial Orientation. This factor concerns the ability to detect 
accurately the spatial arrangement of objects with respect to ones 
own body, The factor would be necessary in deciphering pictures 
taken from a maneuvering airplane. If the plane is simultaneously 
turning and climbing, the landscape looks very different from the 
normal view, The individual who can accurately detect what maneuver 
the airplane is going through from looking at only a picture of the 
landscape from that vantage point has good spatial orientation. The 


factor appears most Prominently when tlie spatial problems are pre- 
sented under “speeded” conditions, 


Typical items: 


Every alternative that could be obtained bya 
rotation of the first figure is to be marked, 


Spatial Visualization, This second Spatial factor differs in a subtle 
manner from spatial orientation, It is Present when the student is 


ject would look if its spatial 

is good statistical support for 
en considerable difficulty in under- 
standing the underlying processes, Spatial Orientation seems to require 
either an actual or imagined adjustment of one’s own body. In spatial 
visualization the student cannot solve the Problem bya bodily adjust- 
ment; instead, the student must conceive of how an object would look 
if its spatial Position were markedly changed. In contrast to spatial 
orientation, Spatial visualization is best tested under relatively “un- 
speeded” conditions, 


Typical item: 


The subject is shown a folded Piece of Paper with a minber at Teles 
punched in it. He is asked to choose from a number of alternatives how 
the paper would look if unfolded. 


‘Reproduced by permission of Science Research Associates, 


227 
Factors of Intellect 


Perceptual Factors 


A number of factors have been found which concern the ability to 
detect visual patterns and to see relationships within and among 
patterns. Some of these factors are apparently of only limited im- 
portance, such as the ability to judge certain types of illusions. 
Several of the more important factors will be described. 

Perceptual Speed. This factor is concerned with the rapid recog- 
nition of perceptual details and particularly with the similarities and 
differences between visual patterns. 


Typical items: 
a geometrical form and asked to choose from a 


1. The student is shown 
as the first. 


number of other forms the one that is the same 


The student is told to make a check mark by each pair of letter group- 
ings if they are identical and to make no mark if they are different. 


to 


* Id _ X'#1Q 
. a&3 (oK 
c —— ro-) w. 


(Fifty to several hundred pairs would be used, depending on the time 


allowed.) 


Perceptual Closure. This factor concerns the perception of objects 


from limited cues. The word “closure” means a sudden awareness of 
an obscure object or relationship. Perceptual speed requires only the 
recognition of a perceptual form. Perceptual closure requires the 
“putting together” of a perceptual form when only part of it is 


presented. 


Typical items:* 


There is some evidence for a flexibility of closure factor in problems 
which require the subject to detect one perceptual pattern which is 


? Adapted from Thurstone (79). 


228 


Prediction and Trait Measurement: Human Abilities 


imbedded in a distracting or competing pattern. This factor is found 
in such items as the hidden-picture games that are printed in news- 
papers and some children’s magazines. At first glance, the picture 
looks like a normal landscape, but after careful scrutiny a number of 
faces can be found hidden in the trees and rocks. In order to see the 
faces, it is necessary to resist or “break down” the perception of the 
object in which the faces are imbedded. 


Typical items: 


+ th x G 


m d 


e 


In each of the two items, the fi 


gore on the left is embedded 
in one or more of the four alter 


native figures on the right. 
Speed Factors 


The question of whether a test should be “speeded” or not has come 
up a number of times in previous chapters, Nearly all tests have a 
time limit for practical Purposes, if for no other reason. In a vocab- 
ulary test, for example, there is no intention of hurrying the student, 
and the time limit is usually liberal enough to satisfy most students. 
Some abilities are intimately connected with speed, such as typing 
and shorthand speed; and also some of the factors which have been 
discussed, such as word fluency and perceptual speed, can only be 
measured by the use of “speeded” tests. 4 

Considerable interest has been shown over the last fifty years in 
determining whether or not speed and accuracy are the same in 
psychological tests, Although the findings on this topic have not been 
consistent and the answer is not the same with all tasks, the general 
conclusion is that speed goes along with accura y to an extent, but 
that these two attributes are far from perfectly correlated. A com- 
parison of scores given under “speeded” and “nonspeeded” conditions 
shows that at least one and perhaps a number of speed factors can 
be isolated. It is not known at this time just how speed factors inter- 
act with the other ability factors. 


* Reproduced by permission of Science Research Associates, 


229 
Factors of Intellect 


Factors and the Child 


After looking at the factors presented in previous sections and con- 
sidering that these are only some of the many factors that have been 
found, the reader is likely to have the discomforting feeling that the 
school child is fragmented into many different pieces which are hard 
to put together in a living image. Fortunately, the problem is not 
quite that complex. Some of the factors which have been cited, and 
most of the many which have not been cited, are of relatively little 
importance for the teacher in the classroom. Although a great deal 
more research needs to be done before we can say with confidence 
how factors of ability grow, change, and interact with success in daily 
life, there are some guideposts that can be used. 

Verbal comprehension is by far the most important factor to con- 
sider in relation to schoolwork. Schoolwork mainly consists in a world 
of words—printed and spoken. The child reads his assignments, learns 
the meaning of new words, gives verbal replies to questions by the 
teacher, talks to his friends in relation to group projects, and writes 
ent events. The school setting is a highly 
verbal world, and unless the child has an understanding of words and 
how words go together in sentences, paragraphs, and books, he is 
crippled. It may be that, depending on his vocation in later life, he 


will not be so dependent on verbal comprehension, but in the school 
all important. The first four or five 


almost synonymous with the develop- 
ment of verbal comprehension. Because it is of such central impor- 
tance, a good test of vocabulary or reading comprehension is the best 
indicator of scholastic aptitude as shown in the primary grades, If a 
child is very high in verbal comprehension and only average in some 
of the other important factors (which is a rare occurrence), he still 
probably will do well in schoolwork. His lack of ability in other areas, 
such as in the areas of reasoning factors, may hinder him in higher 
education and in vocational pursuits, but because of the centrality of 
sion for schoolwork, he will probably manage to 
, at least at the elementary school level. 

es the student with a special advantage. He 
a good impression on others. As was 
ally speaking, verbal compre- 


Paragraph descriptions of rec 


setting verbal comprehension is 
grades of elementary school are 


verbal comprehen 
make good grades 

Verbal fluency provid 
can talk fluently, and he makes 
mentioned previously, although, statistic 
hension and verbal fluency tend to g0 together, there are some real 
exceptions. Consequently, the teacher should be on guard not to 
mistake sheer fluency for the deeper understanding of words and 
written material. Verbal fluency allows an individual to “sell himself,” 
and because verbal fluency is not always matched by a high level on 
other important factors of ability, such an individual often oversells 


230 


Prediction and Trait Measurement: Human Abilities 


: i l 
himself to the subsequent disappointment of teachers in schoo 
situations and employ: 


ers and coworkers later in life. 

The numerical computation factor is mainly important for ope 
topics in the early elementary grades. Children must learn to add 55 
subtract, divide, obtain square roots, and develop other ae 

more frustrating to a teacher than to have a child 
who consistently flounders in number skills. Although there is oe 

€n numerical computation and other important = 
and the reasoning factors, such 
Consequently, a child can have 
© superior with respect to oon 
n the deficiency in numerical skills 
is very extreme, such as the eighth-grade child who still cannot add 
and subtract, it probably indicates an over-all deficiency in me 
lectual abilities. Other than for these extremes, facility in numerica 
computation is a nice thing to possess, but not intimately related 5 
eventual achievement in life. Even some of the better mathematicians 
(whose abilities depend more on the reasoning factors) are very poor 
at numerical computations, 

The reasoning factors are, of course, all important for high-level vo- 
cational and professional activities later in life. They also play some 
part in success in school Situations, more so in high school and college 
than in the elementary grades. The reasoning factors are directly in- 
volved in learning mathematics (not number skills) in the form of 
simple algebraic problems in the higher levels of elementary school 
and in mathematical topics presented in high school and in college. In 
school situations, the reasoning factors mainly come in slay when 
the student is required to do something on hig . = A to write 
an essay on world government, In such instances, the child who can 
invent solutions to problems, see logical consequences, see parallels 
between historical trends and what is going on now, has a distinct 
advantage. 

Memory factors come into play mainly when there 


are many simple 
facts, dates, and names to be mastered. The 


Bite: ia Memory abilities are not 
intimately related to the higher abilities, verbal comprehension and 


the reasoning factors. A child can be quite intelligent in other ways 
and have great difficulty in memorizing simple facts and details, Mem- 
ory apparently is strongly influenced by the desire to 1 
the patience to work at it. As is true of nearly all the ability factors, 
when the deficiency is quite extreme, e.g., the twelve-year-old child 
who is unable to memorize the multiplication tables, it suggests an 
over-all deficiency. Otherwise, memory factors are not highly prog- 
nostic of eventual high-level success in college and beyond. Memory 
play’s a very important part in the first severa] grades at the primary 


231 


Factors of Intellect 


level. Much of the child’s time is consumed with memorizing the al- 
phabet, sounds relating to alphabetical letters, the sequence of num- 
bers, names, and simple facts. Although verbal comprehension also 
plays an important part in the first several grades, the important part 
played by memory factors often obscures the true ability of some 
children and causes the teacher to overestimate the ability of others. 
Some children who perform poorly when details must be memorized 
manifest a superior ability in later schoolwork. 

Nearly all children (excepting some true mental deficients) can be 
taught to memorize the essential details in schoolwork, First, it would 
be better if teachers discouraged memorizing for its own sake, e. g., 
memorizing long lists of names, places, and dates. This does not con- 
stitute true intellectual accomplishment, and the child will soon forget 
most of what he has memorized. Second, in those places where mem- 
orization is essential for subsequent understanding, e.g, memorizing 
the multiplication tables, the ability of all students to memorize mate- 
rial can be greatly increased by the gifted teacher who turns memoriz- 
ing into pleasant ‘games rather than dull activities. 

The spatial and perceptual factors play only a modest part in suc- 
cessful performance in elementary and secondary school. Spatial fac- 
tors are somewhat important for secondary mathematics, for special 
mathematics courses in high school (e.g, geometry and trigonometry ), 
and, to some extent, for physics. Perceptual factors largely are 


auxiliary skills that help students to only a small extent, e. g., in proof- 
reading an English theme for errors in spelling. Spatial and perceptual 
y when students are out of school in professional 


its. Airplane pilots must have spatial orientation. 
to the draftsman; perceptual speed is im- 
Tests of spatial and perceptual abilities are 
ed in high school for vocational coun- 


factors come into pla 
and vocational pursu 
Visualization is essentia 
portant for filing clerks. 
quite useful in batteries employ 


seling. 
The need to test for many different factors of ability, rather than for 


general ability only, varies directly with the school level. In the pri- 
mary grades only several of the factors we have discussed are impor- 
tant, mainly verbal comprehension and general reasoning ability. 
Later in elementary school some of the other factors play an important 

? tions, word fluency, and others. In high 


part: deduction, seeing rela 
school some specialized topics, such as typing and mechanical draw- 
ing, bring into play some of the more specialized perceptual and 


spatial factors which were of relatively little importance earlier. As the 
student goes into college and eventually out into vocations and pro- 
fessions, an even wider array of factors must be taken into account. 

The schoolday world of the child is comparatively simple. He must 
possess verbal comprehension, some reasoning ability, and at least a 


232 


Prediction and Trait Measurement: Human Abilities 


modicum of skill in verbal fluency 
these, he will succeed, In contrast, 
are quite complex and in many inst cult 
tors which we seldom consider and which are presently very Aer a 
to measure, Because of the changing intellectual requirements, it 15 
be said that abilities are actually more general in children than adu i : 
The requirements on them are more Seneral in that they relate to only 
several of the factors which we h 7 
tellectual functions are More general in children, Apparently the many 
Specialized factors Which are found in factor-analytic studies, eS» 
spatial visualization, are due to the Special interests and experiences 
which occur as the individual matures. Because children in the a 
mentary grades have not been exposed to those varied activities, ee 
ferences jn related skills do not form important factors and do oe 
need to be taken into account, The greater generality of pen 
functions in children, and the generality of requirements at the 110 
mentary levels, makes it feasible to test for general ability (as will s 
described in the next chapter) without losing a great deal of informa- 
tion. Because only a few of the factors are very important in the ele- 


i F T, 
mentary grades, and because these tend to correlate with one anaue 
it makes some sense to add them all together in one test of genera 
ability. 


has 
and the memory factors. If 95 5 
vocational and professional 1755 
soteric fac- 

ances may depend on esoteric f 


What has been said and shown about the factors of intellect has 
two important implications for teachers, First, as human abilities grow 
more complex in the high schoo] Years, it is important to give meas- 
ures of some of the Most significant factors rather than to test for gen- 


115 ks ents 
eral ability only. Such measures wil] Prove useful in advising student 
about future schooling 


i 17 55 
g and Vocations, Second, the factor-analytic r 
sults provide a lesson th 


ite uneven in his abilitie 


S, €g. very high in the memory 
factors but Very poor in verb 


al comprehension, In addition to f 
and appraisals of students’ abilities, teachers shou 

be on guard to look at the differentia] abilities of each student. wenn 
are better at certain kinds of intellectual tasks than a 
i 8 “wrong” when this occurs. Because a Si 

Several intellectual dimensions, e.g., the ver 
dimensions, teachers often Wrongly conclude that they should be es 
high in others, such as general reasoning, Some unevenness 15 be he 
of ability is to be expected of all students. Consequently, rather is 10 
ask the oversimplified question of how bright is Ted Sp cial: 
be more meaningful to ask: How high is he in verba ine 
What is his level of verbal fluency? How does he do on r 


233 
Factors of Intellect 


tion? How well does he perform in numerical computation? To have 
to consider these somewhat different dimensions of intellectual ac- 
tivity, rather than talk about one over-all dimension of intelligence, 
complicates our work considerably, but it brings us closer to a faithful 
map of human ability. i 


Multifactor Batteries 


i Instead of employing only tests of general ability, it would be pos- 
sible to employ multifactor batteries containing tests for some of the 
important factors discussed in previous sections of this chapter. There 
are two reasons why multifactor batteries are not used more often than 
they are in school settings. First, as was mentioned previously, in the 
elementary grades, abilities of students tend to be more general, and 
one general test, such as those that will be discussed in the next chap- 
ter, goes a long way toward indicating the student's ability. The second 
reason is that a thorough multifactor battery is very difficult to con- 
struct and expensive and time consuming to apply. In order to meas- 
ure a factor, such as the general reasoning factor, it is necessary to 
have at least two tests relating to the factor. If the battery contains 
Measures of as many as six factors, this makes for a total of twelve 
tests. Because each test would require a minimum of twenty or thirty 
minutes, the total battery would consume a great deal of time. Scoring 
and interpretation also would require considerable time on the part of 
teachers and school counselors. When more funds are available for 
testing, and when better multifactor batteries are made for the ele- 
Mentary grades, we can expect a wider use of such tests at these early 
levels. At the present time, few uses are made of multifactor batteries 
until the high school level. : j 

In the following section will be described a multifactor battery that 
is frequently used with high school students. For each type of test to 
be discussed in this and the following chapters, one or several examples 
will be presented. To prevent the book from becoming a dreary cata- 
logue of tests and their individual characteristics, no effort will be 
made to give a comprehensive review of all the good tests in each 
area. For example, there are at least a dozen good group tests of in- 
telligence, and to describe them all would make for very dull reading. 
In addition to the examples which will be described in the text, other 
good tests relating to each area of measurement will be described in 
the Appendix. : 

The Differential Aptitude Tests. The Differential Aptitude Tests 
(DAT) were developed by the Psychological Corporation (9), prin- 
cipally for use in the vocational and educational guidance of high 
school students. Although the tests were intended for use with stu- 


234 


Prediction and Trait Measurement: Human Abilities 


dents in grades 8 through 12, they have a sufficient range of item dif- 
ficulty to be used with most adult groups. The tests were not devel- 
oped directly out of factor-analytic work but were composed in such 
a way as to incorporate some of the major findings from factor analysis. 

The subtests of the DAT are as follows: 

1. Verbal reasoning. The items consist of verbal analogies, in 
which the reasoning component rather than the difficulty of words is 
emphasized. The test is concerned both with reasoning factors and 
with verbal comprehension. 

2. Numerical ability. The test covers a wide range of numerical 
computations. It should be a good measure of the numerical computa- 
tion factor as it was previously described. 

3. Abstract reasoning. This test differs from the verbal reasoning 
test in that all the problems deal with abstract patterns. 

4. Spatial relations. The test items concern the individual's ability 
to imagine how objects would look if they were rotated in space and 
to visualize a three-dimensional object from a two-dimensional pattern. 

5. Mechanical reasoning. The test items consist of pictures which 
portray mechanical problems. The student is asked questions about 
each picture. 

6. Clerical speed and accuracy. The test is modeled closely after 


the perceptual speed factor. The student is required to find identical 
sets of numbers and figures. 


7. Language usage. 
knowledge, or achieve 
parts of the 
sentences, 


This test is more concerned with acquired 
ment, than with a specific aptitude. The two 
test concern spelling ability and grammatical usage in 


Sample items: 


Verbal Reasoning 
is to one as second is ace. arg 


1. middle 2. queen 3. rain 4. first | 
A. two B. fire C. object D. hill 


is to night as breakfast is to. 


l; flow 2. gentle 3. supper 4. door 
A. inelude 


B. morning C. enjoy D. corner 


Pick out words which will fill the blanks so that the sentences will be true and 
sensible. For the first blank, pick out one of the numbered words. For the second 
blank, pick out one of the lettered words. 


Figure 11-1, Sample items from the Differential Aptitude Tests. (Reproduced 


by permission, Copyright 1947, 1961, The Psychological Corporation, New York, 
N.Y. All rights reserved.) 


* Reproduced by permission. Copyright 1947, 1961, The Psychological Corpora- 
tion, New York, N.Y, All rights reserved. 


235 
Factors of Intellect 


Abstract Reasoning 


Problem figures Answer figures 


TED ELA 


Problem figures Answer figures 


DAI ALL 


(a) b (c) ta) te) 


Eac r i i 

1 h row consists of four figures that are called Problem Figures and five called 

he 0 Figures. The four Problem Figures make a series. You are to find which 
of the Answer Figures would be the next, or fifth one, in the series. 


Space Relations 


an be folded into figures. For each 


The tes . i 
he test consists of forty patterns which ¢ 
which of these figures can be 


Patte . n r i 
an ik rn, five figures are shown. You are to decide 
de 
e from the pattern shown. 


Numerical Ability 


Subtract 30 A 15 


Add 13 A 14 
20 B 2% 


12 B 25 
C 16 C 16 
D 59 D 8 


© none of these none of these 


Choos 
se the correct answer. 
Figure 11-1. Continued 


236 


Prediction and Trait Measurement: Human Abilities 


Mechanical Reasoning 


Which man has the 
heavier load? (If equal, 


mark C.) 


Which weighs more? 
(If equal, mark C.) 


Language Usage Part I: Spelling 
man 
gurl 
catt 
dog 


Indicate whether or not each word is spelled correctly. 


Language Usage Part II: Sentences 
Ain't we / going to the / office / next week / at all. 


A B C D E 

The test consists of a series of sentene: 
A, B, C, D, and E. You are to look ¢ 
parts have errors in grammar, 


, each divided into five parts, lettered 
cach and decide which of the lettered 
punctuation, or spelling. 


Figure II. I. Continued 


237 
Factors of Intellect 


Clerical speed and accuracy 


Test Irems SAMPLE OF ANSWER SHEET 


AD AE AF AE 


BA Ba Bb 
B7 7B AB 
bA BA bB 
33 B3 BB 


Each test item consists of five letter and number combinations. You are to look 
at the one combination that is underlined and find the same combination on the 
answer sheet. (The examples above are all correctly done.) 


Figure 11-1. Concluded 


Evaluation of the DAT. Although the DAT was not derived di- 
rectly from factor analysis, the tests represent a collection of reason- 
ably independent measures which range broadly over those factors 
most directly related to school achievement. The tests were con- 
structed primarily for school programs, and it is doubtful that they 
Would meet with the same level of success in other testing situations. 
The correlations among the tests vary considerably, going from .06 to 
as high as .67. The intercorrelations run about .50 on the average. Al- 
though it is desirable to have lower correlations, they are not so high 


Table 11-3: Median Correlations of Differential Aptitude Test Scores 


with School Grades“ 
(Adapted from G. K. Bennett et al., 9) 


Soc. 
stud., 
Sci- his- Lan- Typ- Short- 


Eng- 4 
Test lish Math. ence tory guages ing hand 
Verbal Reasoning (VR) 50 39 54 50 pe 15 s 
Numerical Ability (VA) 48 50 51 - 77 55 A 
Abstract Reasoning (42) 36 35 ae 55 T “A z 
Space Relations (SR) 27 32 36 ry 5 5 1 
Clachantenl Reasoning (MR) 24 22 38 2 7 
Jerical Speed s “ouracy ' . 
1082 oo and Accuracy m p 20 m ” m x 
Spellin 4 2 36 36 31 26 55 
elling (Spell.) 9 : : 
Seutenre enh: j 32 36 is 46 % 30 %9 


* Decimal points omitted. 


238 


Prediction and Trait Measurement: Human Abilities 


as to prevent the batter 
aptitudes, 


The reliabilities of the tests are generally high, ranging from 1 
93 for all except the mechanical comprehension test. The relia . = 
for men on the mechanical comprehension test is sufficiently high, v 
a mean coefficient of -85, but the reliability for women is only 15 sae 

A particular point in favor of the DAT is the large amount 0 111 55 
search that went into the standardization and validation of the ins 


1017 s i ntial 
y from functioning as a measure of differentia 


z T 
Table 11-4; Percentile Equivalents of Average Scores on the DA 
for Men in Various Educational and Occupational Groups 

(Adapted from G. K. Bennett et al., 9) 


Percentiles 


nt. 
Group No. VR NA AR SR MR OSA Spell. Ser 


Degree-seeking st udents: 


Premedical 24 88 8 81 72 7 77 90 90 
Science (biology, chem- 72 
istry, and math.) 25 81 85 60 67 68 74 81 
Engineering (includes . 74 
architectural) 70 80 86 go 81 82 67 68 
Liberal arts (includes s1 
prelaw) a W 75 7 6 84 75 78 68 
Business administration 64 72 73 61 63 60 71 67 
Education (includes 66 
Physical education) 25 68 66 5g 57 48 68 67 
Various: predental, 


agricultural, ete. 30 


Nondegree students in 

two-year schools: 

Business, technical, 
arts, ete. 


fine 


72 68 60 54 6l 
Employed: 5 49 
Salesmen 23 56 53 53 52 57 4% 50 
Clerks: general office 2 53 
work 55 45 42 32 m Al = a 
Mechanical, electrical, 
and building trades 66 
Various Skilled: 
baker, ete, 
Various unskilled: truck 
driver, laborer, ete. 85 


butcher, ge 
26 47 37 43 4l 49 51 52 


35 30 36 42 49 37 35 36 
35 * s 

Military service 
Unclassified: 

No consistent Work or 
school record 58 53 


129 46 42 46 51 50 49 46 4l 
£ 46 i 


239 
Factors of Intellect 


ment. The norms are based on the testing of 47,000 students in grades 
8 to 12 scattered widely over the United States. Because sizable sex 
differences were found on some of the tests, separate norms are given 
for boys and girls. 

Thousands of correlations betw 
have been computed. Some correl 


een DAT scores and various criteria 
ations between test scores and class 
grades are shown in Table 11-3. A follow-up study was undertaken of 
students two years after completing high school. The results are shown 
in Table 11-4, where scores for the total group are broken down in 
terms of subsequent occupation or college specialty. Some of the 
groups in Table 11-4 are represented by only several dozen individ- 
uals, and, consequently, the results should be considered tentative. 
More research of this kind will provide the DAT and other similar test 
batteries with a firmer basis for vocational counseling. The DAT is a 
model of careful test design, practicality, thoroughness of research, 


and frankness of reporting. 


Summary 


This chapter was concerned with the “generality” of intelligence. 
Whereas those who construct and employ “intelligence tests” operate 
on the assumption that there is only one factor, or yardstick, of intelli- 
gence, others have claimed that numerous factors underlie intelligent 
behavior. Essentially the question boils down to that of the sizes of cor- 
relations found among different types of mental tests. If such correla- 
tions all are high, it supports the position that intelligence is highly 
general; but if some of the correlations are high and others are low, 
and if definite “clusters” can be found among the correlations, it sup- 
ports the multifactor position. The research evidence shows that there 
is something to be said for both positions. All correlations among men- 
tal tests tend to be positive, which supports the generalist's point of 
view. On the other hand, there are definite clusterings of correlations, 
which indicates that it is meaningful to talk about different factors of 
intellect. 

Apparently there is no end to the number of factors of intellect that 
can be found in mental tests. However, no more than ten of them 
seem to play important parts in real life, and only four or five of them 
play any important part in successful performance in elementary and 
secondary school. Some of the other factors play important parts in 
vocational pursuits and in college training. Although not enough re- 
search has been done to speak With sureness, it is quite likely that 
some of the factors which presently are of little practical use may be 
important for high-level accomplishment in art, science, and other 
professional activities. 


240 


Prediction and Trait Measurement: Human Abilities 


Except for the last two years of high school, little use is made of 


multifactor test batteries, for three reasons. First, we still have a long 
way to go before adequate multifactor batteries are available for the 
elementary levels. Second, those batteries that are available are some- 
what tedious to use and interpret. Third, at least for the first four oh 
five grades of elementary school, success in schoolwork depends very 
heavily on only one factor of intellect—verbal comprehension; conse- 
quently, one test which is heavily concerned with verbal comprehen- 
sion goes a long way toward serving the need. Because tests of general 
intelligence lean heavily on verbal comprehension, they serve very 
well for that purpose, However, as better multifactor batteries are 
constructed, and as educators become better acquainted with their 
advantages, more use probably will be made of the batteries. 

A discussion of the generality of intellect should provide several m 
portant lessons to teachers. Intelligence is not perfectly general, anc 
teachers should look for the ups-and-downs in abilities within each 
student. Whereas a student may be quite capable over-all, there may 
be areas of scholastic activity in which he 
below average, On the other hand, a student who appears to be below 
average in general may be well above average in some scholastic ac- 
tivities. Another lesson that teachers should learn is that it is easy to 
pay too much attention to some factors of intellect that have very little 
to do with long-range success in school and in later work. This is par- 
ticularly so for some of the memory factors and for the numerical 
computation factor, Although these relate to how well children master 
some of the simple skills in elementary school, they have little to do 
with how well students perform later in school, and even less to do 
with high-level accomplishment in life. 


b a]l 
is only average or even wel 


Suggested Additional Readings 


Cronbach, L. J. Essentials of psychological testing, 
& Row, 1960, chaps. 9, 10. 

Fruchter, B. Introduction to factor analysis, Princeton, N. j.: Van Nostrand, 1954. 

Guilford, J. P. The structure of intellect, Psychol, Bull., 1956, 53, 267-293. 

Nunnally, J. C. Tests and measurements. New York: McGraw-Hill, 1959, 164-174. 


(2nd ed.) New York: Harper 


Tests of 
General Ability 


Parents, mostly mothers, are making the supreme sacrifice in behalf 
of togetherness with their sons by attending the monthly cub scout 
pack meeting. While there is a lull in the ceremonies in preparation 
for awarding badges, insignia, and many special citations, mothers are 
talking, as they often are, about their children. Mrs. Crankston wonders 
if the school is doing enough for “gifted children.” Her Charlie has 
been tested by a psychologist and, so she says, “has a high 1Q, but 
they are not giving him any special instruction at Belmont.” At the 
mention of “IQ” the other mothers assume a look of awe, as though 
Something sacred had been broached. 

No other product of psychology has had the same impact on the 
public as have “intelligence tests.” Their results are too often reported 
as though they were infallible guides to all that is wise and good. Such 
tests do serve many purposes well, but only if they are expertly ad- 
ministered, and only if they are interpreted in the light of the limita- 
tions of the tests, the personality of the child, and the special abilities 
which the child possesses. The f this chapter is to explain 
the composition, use, and proper interpretation of “intelligence tests.” 

The term “intelligence tests” has been placed in quotes, and used 
with some misgiving, because it implies more than can be expected 
from any test. Consequently, the first issue to be discussed will be that 
of what the so-called “intelligence tests” actually contain. 


purpose 0 


The Content of Intelligence Tests” 
n that there are a number of 


In the previous chapter it was show! 
elligence. and it was said that 


semi-independent kinds, or factors, of int 
it is somewhat misleading to lump all these together in one over-all 
e. That is what the intelligence tests do—they sample from the 


measur 
a number of factors of intellect. In other words, they seek to 


content of 


241 


242 


Prediction and Trait Measurement: Human Abilities 


Measure how intelligent a Person is in general (or on the average): 
without specifying the Particular ways in which he is more or a 
capable. Because they sample content from a number of factors, it is 
= appropriate to refer to them as measures of general ability rather 
toe grandiose name of intelligence tests. seful in 
ns, tests of general ability have been very useful; 
the Past, and will probably continue to be for some time to come, I 
oe 1 fag that they measure a conglomerate of separable mie 
55 1 — one is one reason why it has been necessary 
he rt = oh a of general ability. It is very difficult wi few 
7 ain 5 which are needed to measure even Such 
eres oe bor ee csetibed in the previous chapter. 115 
e a 15 “i se 100ls to use and time consuming to ob- 
tained, they usually are u a multifactor batteries have benr 
85 eee ot ry d ifficult for most persons to interpre a 
ability is that they a nsible to use measures of pe ht. 
J as conglomerate as might be thoug 


Rather than spre 
ntent evenly meaty ifferent factors, 
they tend to concentrate on only sever, formu aree 


In particular, the 3 

i pa i lar, they tend to concentr. 
prehension, general reasoning 
computation. 7 esser extent g 
8 — ba = . Sample items from memory, Per, 

al, a atial factors : a 
; ; actors which pr inate in mos 

e ieee . ; vhich predominate ir 
l 8 eral ability are the ones that intuiti B “look” important, 
and as studies have shown, they are ei Man 
. 7 5 7 
in schoolwork. These factors also te 


other, For these reasons 35 e 
mons: tests of Seneral ability primarily measure 


some of the more “important” f 
a actors re irely i ical 
hodgepodge of mental functions, ather than a SRE” 


The factor content of most tests of 


planned. Rather, each test constructor sa E i 1 55 
they measured “intelligence.” in el nage = that looked m 
found that most tests tend to measure the R „tic studies, it 5 
quently, most of them correlate highly with S factors, and, popa 
amples of the types of items that tend to predor another, Some 5 
general ability are shown as follows: minate in measures 


nd to Correlate well with one an- 


general ability was not specifically 


VOCABULARY; 


Indignant means most nearly the same as 
a. poor 

b. lazy 

c. angry 

d. spiteful 

VERBAL RELATIONS: 


Ship is to sail as automobile is to — 


243 
Tests of General Ability 


VERBAL MEANING: 


What is the meaning of the saying “penny-wise and pound-foolish ? 


FIGURAL RELATIONS: 


O o H 


Which one of the following comes next? 


EMAL 


Which one does not belong with the others? 


AAN 


Which small figure correctly fills the missing part? 


J E S E 


ARITHMETIC REASONING: 
A boy has twelve apples. He gives half of these to his mother, and half 


oF ikari NDA friend. How many apples does he have left for himself? 
a8 b. 2 © d. 3 


GENERAL REASONING: 


a stream with one 2-quart bucket and one 3-quart bucket. 


A boy goes to 8 ; 
y & nge it so that he will take back exactly 4 quarts of water? 


How can he arra 


244 


Prediction and Trait Measurement: Human Abilities 


GENERAL INFORMATION: 


What time is it when the minute h 


i . 5 e 
and is pointing straight up and th 
hour hand is pointing straight down? 


What is the smallest state in the United States? 


PRACTICAL JUDGMENT: 


Suppose that you are walking by yourself 


re. 
and you see a house on fir 
No one else sees the fire. 


What should you do? 


ABSURDITIES: 
ETES 4 * R » ake sure 
Billy’s mother said You are going to be late for school.” To make st 


ack one 
that he got there on time, he set the hands on the kitchen clock back on 
hour. What is foolish about that? 


OBJECT ASSEMBLY: 


Put the pieces together to make a square, 


a 


245 
Tests of General Ability 


Achievement and Aptitude 


dized achievement tests are intended 


armed in school. Tests of general 
learn in the 


10 e tests and standar 
dbilie asute how much a student has le 

y are intended to measure how much the student car 
ail. Another way of saying it is that tests 
the capacity to learn rather 
has been learned up to 


pe ae ideal conditions prev 
than se ability are intended to measure 
a 4 Rs with achievement tests, how much 
particular point in time. 
cone distinction between aptitude and achievement is commonplace 
daily life. We say such things as If he wanted to, he could be an 
excellent student.” “Even though he has only limited ability, he makes 
grades by studying night and day.” “His real ability is hidden by 
10 nid schools which he previously attended. 
thie. ould indeed be helpful if we had pure measures of aptitude. 
a, would be particularly helpful in spotting children who would do 
1 better in school if they were given better training or if their 
57 and home environments could be improved. Unfortunately, 
on S only partially accomplished by Le pee T 
s of general ability. At the present time we do not know how 
kin see e other than by N child amas 
SARE 8 particular point in time. There is no way to get inside 
i the child to obtain a pure measure of how much he could accom- 
plish in ideal situations. Rather, what the tests of general ability tend 
to do is measure relatively abstract 


ability, ability that is not as de- 

Pendent on specific types of school instruction as is required by 
achievement tests. É 

A child probably would not be able to answer questions about the 


Mineral resources of Africa or about the United States Constitution 


unless he has had good instruction in those topics in school and unless 


he has made a concentrated effort to study the materials. Such ques- 
tions are closely bound to the quality of instruction and the energetic 
study of the student. In contrast, some of the types of items used on 
tests of general ability are not highly related to the richness of the 
Student's school and home environment. For example, the figural rela- 
tions items shown previously do not obviously relate to anything that 
as learned in the home or in school. 

The value of most tests of general ability is that they contain items 
Not as directly related to school and home environment as are those 
found on most achievement tests. The effort in measures of general 
ability is to use questions that any intelligent person could answer 
from his daily experience rather than ones ‘that require specific types 
of instruction. This is true even of items measuring vocabulary and 
general information, Efforts are usually° made to employ words and 
facts that largely could be learned by any intelligent person. 


246 
Prediction and Trait Measurement: Human Abilities 


Even with the best of efforts, it presently is not possible to measure 
aptitude completely apart from achievement. The same factors that 
make for achievement in school also, to a large extent, make for suc- 
cess on tests of general ability. The two types of tests usually corre- 
late highly, usually around .70, with the exact size of the relationship 
depending on the type of school and the year in school. The difference 
is more a matter of degree than a matter of kind. Achievement tests 
are relatively more concerned with present accomplishment; aptitude 
tests are relatively more concerned with abstract ability. 

When students’ scores are compared on achievement tests and tests 
of general ability, the differences usually are not large, and many of 
the apparent differences are due only to measurement error. However, 
when such differences are quite large, they provide important diag- 
nostic information about students. It is in these instances that the 
major usefulness of tests of general ability is shown. 


Individual Versus Group Tests 


Tests can be given to each individual separately, or a group of indi- 
viduals can be tested at one time. Although this is a consideration for 
tests of all kinds, it is a particularly important issue in the use of gen- 
eral intelligence tests. The multifactor batteries are almost always 
group tests. There is considerable competition between individual and 
group tests of general intelligence. Numerous articles have appeared 
in the psychological literature arguing about whether group or indi- 
vidual tests should be used with particular age groups. 

Individual Tests. The first practical general intelligence test, the 
Binet-Simon scale, was administered individually. This was necessary 
because the subjects were young children. The individual test requires 
a highly experienced examiner. The examiner is in essence a part of 
the standardized testing procedure, and he must standardize his own 
treatment of the child to conform to established methods. 

Many of the items on individual tests cannot be scored unambigu- 
ously as right or wrong. Instead there may be a number of acceptable 
responses, and different scores are often required to indicate the de- 
gree of correctness. The better-established individual tests go to con- 
siderable lengths to specify just what will be considered a correct re- 
sponse and how much credit should be given for a response. The ex- 
aminer must follow the established scoring procedures meticulously 
and not permit his subjective judgment of a child to influence the test 
results. Any idiosyncracies in scoring will make the test less reliable. 

Group Tests. The first practical group tests of general intelligence 
were developed for the Armed Forces during the First World War. 
The number of men to be tested required a quick and economical 


247 
Tests of General Ability 


measurement device, and the individual test was unsuited for that 


purpose. 

Most group tests are sufficiently self-explanatory so that the test ex- 
aminer need have little or no specialized knowledge of testing pro- 
cedures, The test forms are simply passed out to subjects; either they 
are allowed to work at their chosen rate or the examiner directs the 
subjects when to start and stop. 

Comparison of Individual and Group Tests. Although the following 
rules are not precisely correct for all group and individual tests, the 
rules are sufficiently general to offer a reasonable basis for choosing 
between the two kinds of tests in most situations. 

1. Individual tests are required with young children, Starting at the 
earliest age, there are no group tests that can be used with infants. 
Each infant must be examined separately, and any effort at standardi- 
zation of procedures is difficult. With preschool children, it is usually 
necessary to use individual tests. Young children either cannot read at 
all or they lack the reading ability to take the self-explanatory group 
forms. As an additional factor, young children are highly distractible, 
and it is all that the expert examiner can do to keep one child working 
at the test materials. Young children are often not motivated to do well 


on tests, and it is only through the examiner's careful, but stand- 
ardized, encouragement that a meaningful measure can be obtained. 
2. Group tests more frequently are used for testing normal” adults. 


The well-standardized group tests prove to be as good predictors as 
the individual tests when working with most teen-age and adult 
groups. Although the evidence is clear that individual tests are better 
Predictors for young children and that group tests do as well with 
adolescents or older, it is not certain which kind of test is generally 


More valid with the in-between age group- 

Adolescents and older persons are usually 
enough to manifest a meaningful score on 
Probably are less embarrassed by the group 
Would be in the face-to-face individual testing situation. 

Because group tests tend to be equally valid as individual tests when 
administered to adults, they are much to be preferred in practical 
work, Group tests are much less expensive and time consuming. It 
takes no more time to administer and score twenty group tests than 
it does to administer and score one individual test. 

3. Individual tests are often useful in clinical settings. The school 
Psychologist often can learn considerably more from the individual 
test than the subject’s score would indicate. The child who appears 

ull in the classroom may be only hard of hearing. Another child may 
do poorly in schoolwork ‘because he wants to do poorly, giving wrong 
answers when he knows what is correct. The older adult may appear 


motivated and attentive 
a group test. Also, they 
testing situation than they 


248 
Prediction and Trait Measurement: Human Abilities 


to be demented because he is discouraged and withdrawn. These 
things probably would not be found in group testing situations, but an 
experienced clinician can often use the individual testing situation to 
diagnose why an individual is performing poorly. 

4. Group tests usually are easier to construct than individual tests. 
For the person who plans to construct a predictor test, an individual 
test should not be undertaken unless measurement specialists are avail- 
able and it is planned to spend considerable time and money in test 
construction. The difficulties in constructing test materials, standardiz- 
ing the test, and particularly the setting of instructions for administra- 
tion and scoring make the development of an individual test a time- 
consuming, expensive job. Training test examiners for the individual 
tests creates additional expense. 


Verbal Versus Performance Tests 


There has been some confusion about the difference between “ver- 
bal” tests and “performance” tests, and the distinction is itself some- 
what misleading. The following outline of verbal components in tests 
is offered as a basis for discussing test content: 


VERBAL REQUIREMENTS: 


1. Understand spoken language 
2. Understand written language 
3. Speak language 
4. Write language 
5. Verbal comprehension factor 


There are many different combinations of these requirements in par- 
ticular tests. A test may require the first four items in the list, but deal 
with language at so simple a level that very little ability in verbal 
comprehension is required to obtain a high score. It is possible to make 
up a test in which almost none of the five aspects is required. This can 
be done by giving the test instructions in pantomime and using test 
materials that require neither written nor spoken responses. It is also 
possible to compose a test in which none of the first four aspects is 
present and the fifth aspect, verbal comprehension, is a cardinal re- 
quirement. If the test requires the child to manipulate abstract sym- 
bols or to deal with pictures, this will tap, in part, the verbal compre- 
hension factor. Each test should be examined in terms of its combina- 
tion of verbal requirements rather than simply classified as “verbal” or 
“nonverbal.” 

Another distinction can be made among tests in terms of the way in 
which responses are made: 


249 
Tests of General Ability 


NATURE OF RESPONSE: 


1. Symbolic response. The subject indicates the correct answer 
either through the use of language or by marking one of a number of 
choices, The symbolic response might be made with respect to objects 
rather than printed materials, although this is usually not done. 

2. Manipulative response ( performance). The subject is required 
to handle objects in such a way as to complete a specified product. 
The product may be anything from a completely finished piece of 
machinery to the arrangement of a set of blocks. 


Some items are not clearly differentiated in terms of the two kinds of 
responses. For example, in maze tracing, the child is required to coor- 
dinate the pencil and move to the goal—the response is as symbolic as 
it is manipulative. 

It has been the custom to ca 
they deemphasize language requirements, employ 
materials, and require manipulative responses. Because of these com- 
ponents, the performance tests usually measure motor coordination, 
Speed, and perceptual and spatial factors. Instruments are usually 
referred to as “verbal” tests if they are printed forms, emphasize ver- 
bal comprehension, and require symbolic responses. Because of the 
ease with which certain kinds of test materials can be placed on 
Printed forms, the verbal tests tend to measure verbal comprehension, 
numerical computation, and the reasoning factors. There is no clear- 
Cut separation between the factors found in verbal and performance 
tests, but there is a tendency for different factors to arise in the two 


kinds of materials. 

In addition to the two extreme ty 
heavily involved in verbal requirements 
almost exclusively manipulative, there is a third important type of test, 
Which is variously referred to as “nonlanguage,” “culture-free,” and 
culture-fair.” In this third type of test, the student is required neither 
to use and understand language nor to manipulate three-dimensional 
Objects. Rather, the test items consist of symbolic responses (multiple 
choice) to relationships among figures and designs. Typical of these 
types of items are the examples given earlier of items involving figural 
relations. Such tests have a very important place. They avoid complete 
dependence on verbal ability, and they apparently measure more 
important intellectual functions than do performance tests. 

The following sections will describe some typical group and individ- 
ual tests of general ability. Representative tests will be described for 
diferent age groups. As is the rule throughout the book, no effort will 
e made to present an exhaustive (and, consequently, unreadable ) 
st of all the good measures available. Mention is made of other good 


ll instruments “performance” tests if 
three-dimensional 


pes of measures, one extreme type 
and the other extreme type 


250 


Prediction and Trait Measurement: Human Abilities 


tests of general ability in the Appendix. After representative tests are 
described, some of the major research findings about general ability 
will be discussed, and some recommendations will be given for the use 
of tests of general ability in elementary and secondary schools. 


The Binet Test and Its Followers 


As will be recalled from the previous chapter, Alfred Binet pioneered 
in the theory and practice of measuring general ability. His immediate 
practical problem was to construct tests to be used with “problem 
children” in French elementary schools. Among those children who 
were failing in school, some method was needed for distinguishing 
those who lacked the capacity to learn from those who might profit by 
special instruction. Binet’s point of view was that for practical pur- 
poses, it would not be feasible to measure intelligence in terms of the 
many constituent factors involved, but, rather, that it would be neces- 
sary to measure general ability by its end products. Consequently, in 
his tests Binet emphasized the ability to make correct judgments, the 
ability to solve problems, and the ability to understand words and 
written material. 

Working with Simon, in 1905 Binet (10) produced the first crude 
measure of general ability. The test consisted of thirty items, graded 
in difficulty. The first fifteen items in the list are presented as follows: 


1. Follow a lighted match with head and eyes. 
2. Grasp a cube placed on the palm. 
3. Grasp a cube held in line of vision. 
4. Make a choice between pieces of wood and chocolate. 
5. Unwrap chocolate from paper. 
6. Execute simple orders. 
7. Touch head, nose, ear, cap, key, and string. 
8. Point to objects which experimenter names in picture. 
9. Name objects pointed out in a picture. 
10. Judge which of two lines is the longer. 
11. Repeat immediately three digits read by examiner. 
12. Judge which of two weights is heavier. 
13. Solve problems that embody novel, ambiguous, or contradictory 
solutions. 


g 


Ko] 


14. Define house, horse, fork, and mamma. 
15. Repeat sentence of fifteen words after a single hearing. 


The list of problems was tried out on about fifty children, and provi- 
sional norms were established. It was found that the first five items 
could be passed by idiots and normal two-year-olds. Most three-year- 
olds could go no further than about the ninth item. Most five-year-olds 
could go no further than about the fourteenth item. 


251 
Tests of General Ability 


The first test obviously was only a rough beginning to the measure- 
ment of general ability, but it constituted a very important first step. 
Binet’s conceptions of how general ability should be measured and the 
types of items which should be used have dominated most tests of 
general ability to this day. 

Binet later made revisions of his test, employed more items, gathered 
responses from larger groups of children, and developed more de- 
pendable norms. However, the center of activity in the development 
and use of measures of intelligence soon shifted to America, During 
the first decade of this century, the Binet-Simon tests became very 
popular in this country. Instruments of this kind were needed in the 
care of the feebleminded, in educational research, and in understand- 
ing juvenile delinquency. Several translations into English were made 
of the scales, and efforts were made to broaden and further standardize 
the tests. 

The Terman and Merrill Revisions. The most extensive revisions 
of the Binet-Simon scales were made first by Terman in 1916 and then 
by Terman and Merrill in 1937 and 1960 (77). Because their work 
was done at Stanford University, their revisions were named Stanford- 
Binet tests. Their work has so extensively revised and extended Binet's 
tests that it is only out of respect to Binet that the current form of the 
test still bears his name. The major improvements which have been 
made are (a) to try out a large number of items, (b) statistical analy- 
sis to obtain the most effective items, (c) careful writing of instruc- 
tions for administration and scoring, and (d) gathering of norms from 
large numbers of representative children and adults. Since the 1916 
revision, the Stanford-Binet series of tests has been used very widely 

has been translated into many other languages. 
he Stanford-Binet Tests. The Stanford-Binet con- 
alternate (in case something goes wrong in 
from two to five years of age, at 


in this country and 

Construction of t 
ans six subtests and one i 
the testing) at half-year intervals 
yearly 1 from five to fourteen, and at four levels of adult per- 
formance. (At the “average adult” level there are eight rather than six 
subtests.) The items at each level were carefully selected according 
all items were selected because they looked 
lligence” (see previous examples). Second, 
all items retained for the scale should correlate highly with T 
logical age. Although we may argue 7 i og ie. 
gence,” ¢ „ill agree that it grows with the chile. sats ag, 4 
eee ene T only dr rahle standard for sere sadli 5 
Length of foot also grows with age, and it goes wit igne sana EA 
no one would consider that as a measure of intelligence. “ m 
Spect to the standard of age differentiation, a suitable Ta or 5 
seven, year subtest would be one where few of 115 he 9 the 
correct answer, over half of the seven-year-olds get the correc an- 


to three principles. First, 
as if they measured “inte 


252 
Prediction and Trait Measurement: Human Abilities 


swer, and most of the eight-year-olds get the correct answer. Need- 
less to say, it is very difficult to find items that meet the standard. 

The third standard which was applied to the items was that of 
homogeneity. For this standard, each item was correlated with the 
total test, which gives an index of how well the individual items meas- 
ure the same thing as the whole test. This is a sensible standard in 
constructing a measure of general ability, because even though the 
total test may not be a perfect measure, it is logically a better measure 
than any item taken separately. By applying these procedures, the final 
collection of items (a) consists of problems that intuitively relate to 
general ability, (b) differentiates well between adjacent age groups, 
and (c) is relatively homogenous in content. 

The 1960 Revision. The most recent revision of the Stanford-Binet 
(77) consisted largely of recombining and making minor changes in 
items used in previous forms. The 1937 revision consisted of two alter- 
nate forms, called Forms L and M, respectively. Having an alternate 
form available made it possible to retest children after short periods 
of time without having the results affected by memory. However, little 
use was made of Form M. Consequently, in the 1960 revision, the best 
items were selected from both the older Forms L and M, and the pres- 
ent form is referred to as L-M. Other than for the combining of items 
from the two older forms, the only other major change was obtaining 
up-to-date norms. Because of the similarity between the present test 
and those published in 1937, many of the findings with respect to the 
earlier tests probably still hold true. 

Content of the Stanford-Binet. The present test bears many of the 
earmarks of the original Binet test. A knowledge of words and the 
comprehension of written material play a predominant part in the 
scale, particularly at the upper age levels. Although a number of per- 
formance items appear at earlier age levels, these primarily concern the 
child’s recognition and use of common objects. A few items concerning 
spatial and perceptual abilities are found at different age levels. Many 
of the items relate to one or more of the reasoning factors discussed in 
the previous chapter. Three types of memory items appear at one or 
more age levels, including serial memory of digits, paragraph memory, 
and memory of geometric designs. To illustrate the nature of the 
scale, the items at four different age levels are described as follows: 


TWO-YEAR LEVEL: 


1. Three-hole form board. The child is shown a form board con- 
taining a cut-out square, circle, and triangle. The pieces are removed, 
and the child is asked to put them in their places. The child receives 
credit if all three objects are put in place. 

2. Delayed response. The child is shown three small boxes and a 


253 
Tests of General Ability 


toy cat. The examiner says “Look, I am going to hide the kitty and 
see if you can find it.” While the child is watching, the toy cat is 
placed under the middle box. Then a screen is placed in front of the 
boxes for about ten seconds. When the screen is withdrawn, the child 
is asked to choose the box containing the cat. The procedure is re- 
peated putting the cat under the box on the right and then under the 
box on the left. The child is given credit if two out of three first selec- 
tions are correct. 


3. Identifying parts of body. The child is shown a large paper doll 


and is asked to point to parts of the body. (“Show me the dolly’s hair,” 
ete.) Credit is received for correctly pointing out three of the parts. 

4. Block building. A box of blocks is placed before the child. The 
examiner builds a tower of four blocks and asks the child to do the 
same. Credit is received if the child makes a tower of at least four 
blocks. 


5. Picture vocabulary. The child is shown eighteen cards contain- 


ing pictures of common objects. With each picture he is asked, “What 
is this?” Credit is given if the child names at least two of the objects. 

6. Word combinations. The examiner notes the spontaneous speech 
of the child during the test. Credit is given if the child uses at least a 


two-word combination such as “see kitty. 


SIX-YEAR LEVEL: 
asked the meaning of words from a 
graded list of forty-five terms. Credit is given for five correct defini- 
tions. (The same list of words is used throughout all higher age levels.) 

2. Differences. The child is asked to tell the difference between 
three pairs of words, e.g. a bird and a dog. Credit is given for making 
at least two correct responses out of three. ; i i 

3. Mutilated pictures. The child is shown five pictures in which an 
object has a missing part, e. g. a wagon with only three wheels, and 
asked to say what is missing in each. Credit is given for getting as 


many as four of the problems correct. A 3 
4. Number concepts. Twelve blocks are put in front of the child. 
He is asked to give different f the blocks to the examiner. 
81 5 Š ; ars of blocks out of 
Credit is given for selecting E 
four trials. 
5. Opposite analogies. l 
—ů— Credit is given for at le 


1. Vocabulary. The child is 


numbers o 
three correct numb 


“A table is made of wood; a window of 
ast three out of four correct 
responses ors 

6. Maze tracing. The child is shown three designs each of which 
shows two ways for a person to get home. One route 1s ae than the 
other, The child is asked to trace the shorter route. Credit is given 
for two correct responses out of three. 


254 
Prediction and Trait Measurement: Human Abilities 


TEN-YEAR LEVEL: 


1. Vocabulary. The child is asked to define words from the stan- 
dard list. Credit is given for eleven or more correct definitions. 

2. Block counting. The student is shown pictures of piles of blocks. 
Some of the blocks are directly visible, and others are stacked behind 
and beneath. The student is asked how many blocks are in each pile. 
Credit is given for correct responses to at least eight of the eleven 
pictures, 

3. Abstract words. “What do we mean by ____?” eg, 
“curiosity.” Credit is given for correctly interpreting at least two of the 
four words presented. 

4. Finding reasons, I. The child is asked to explain why two social 
rules are necessary, e.g., “Give two reasons why children should not 
be too noisy in school.” Credit is given for supplying two reasons. 

5. Word naming. The child is asked to name as many words as he 
can in two minutes, Credit is given for twenty-eight words or more. 

6. Repeating six digits. Six digits such as 4, 8, 2, 1, 6, 3, are read 
at one-second intervals. The child is asked to repeat the digits in their 
exact order. Credit is given if one or more complete series out of three 
are recalled correctly, 


AVERAGE ADULT LEVEL (Ace 15 AND OLDER) 


L Vocabulary. The subject is asked for definitions of words in the 
standard list, Credit is given if twenty or more are defined. 

2. Ingenuity. Three problems are given. An example is: a boy is 
sent to the river to get exactly 3 pints of water, and he has only a T- 
pint container and a 4-pint container, How can he measure the water? 
Credit is given if two problems are solved. 

3. Differences between abstract words. The subject is asked to dis- 
tinguish between three pairs of associated words, e.g., poverty and 
misery. Credit is given if two or more of the distinctions are correct. 

4. Arithmetical reasoning. The subject is asked to solve three prob- 
lems; e.g., “If two pencils cost 5 cents, how many pencils can you buy 
for 50 cents?” Credit is given for two or more correct solutions. 

5. Proverbs, I. The subject is asked to explain the meaning of three 
proverbs, e.g., “A burnt child dreads the fire.” Credit is given for two 
or more correct interpretations. 

6. Orientation. Questions are asked which require the understand- 
ing of compass directions, e.g., “Which direction would you have to 
face so your right hand would be toward the North?” Credit is given 
for correctly answering at least four of the five questions. 

7. Essential differences. “What is the principal difference between 
— e. g., work and play.“ Credit is given for correctly 
answering at least two of the three questions. 


255 
Tests of General Ability 


8. Abstract words. “What do we mean by —_______?” eg, 
“generosity.” Credit is given for correctly interpreting at least four of 
the five words presented. 

Administration of the Stanford-Binet. The test materials include a 
box of performance items (beads, toys, pictures, etc.), test blanks on 
which the child’s responses are recorded, and a manual of instructions. 
Some of the test materials are shown in Figure 12-1. As was mentioned 


Figure 12-1. Some of the test materials used with the Stanford-Binet, (Repro- 


duced by permission of Houghton Mifflin Company.) 


Previously, the administration of an individual test of this kind requires 
a highly trained examiner, and no faith should be placed in the scores 
obtained by an amateur tester. The test usually can be given in a 
Period of fifty to seventy-five minutes. 3 

No didual is administered all the items. Instead, the ‘pea 
is started slightly lower in the age scale than he is expected ie 5 `, 

Or example, a typical procedure would be to start a seven-year-o 4 
on the five-year-level questions. The child is then taken up through 
all the a fi he can go. 

age levels as far as he can g ; — 

In previous forms of the Stanford-Binet, IQ was obtained by apes 

Mental age by chronological age. As was described in Chapter 3, such 


256 


Prediction and Trait Measurement: Human Abilities 


quotient scores are fraught with statistical and conceptual difficulties. 
Consequently, IQs on Form L-M are determined by a more sensible 
procedure. Now IQs are simply transformed standard scores, with a 
mean of 100 and a standard deviation of 16 for each age level. The 
manual provides tables for rapidly transforming mental age scores 
directly into IQ scores. 

Evaluation of the Stanford-Binet. The primary purpose of the Stan- 
ford-Binet is to provide information which helps in making decisions 
about the educational progress of students in elementary and secondary 
school. The evidence is that it serves that purpose very well. Some of 
the salient features of the test are as follows: 


1. Construction. The test was very carefully constructed and stan- 
dardized. A particularly good feature of the test is the care that went 
into providing detailed instructions for scoring each item. 

2. Reliability. Very careful research was undertaken to determine 
the reliability of the Stanford-Binet IQ at different age levels. An 
equivalent-form reliability estimate was made separately for each age. 
Correlations were found between the scores obtained on Form L and 
Form M administered to the same subjects within one week’s time. In 
general, the findings show that the Stanford-Binet is a highly reliable 
scale, with most of the reliability coefficients equal to or greater than 
‘90. Scores tend to be more reliable for persons in their teens than they 
do for young children. The studies also show that low scores are some- 
what more reliable than high scores in each age range. In other words, 
a bit more faith can be placed in the precision of a very low score 
than ina very high score. 

3. Predictive efficiency. The test has shown itself to be a good 
predictor of different criteria, particularly of school grades. In general, 
the findings have been that Stanford-Binet IQs correlate in the neigh- 
borhood of .70 with elementary school grades, .60 with high school 
grades, and .50 with college grades. (The decline in validity is prob- 
ably due to the progressively decreasing dispersion of intellectual 
ability.) The following correlations (13) were found between Form 


L IQ and high school achievement test scores, The number of students 
ranges from 78 to 200. 


Reading comprehension 73 Spelling 46 
Reading speed 43 History 59 
English usage 59 Geometry 48 
Literature acquaintance 60 Biology 54 


4, Testing of adults. On both subjective and empirical grounds 
there is reason to believe that the Stanford-Binet is a better test for 
children than adults (age 15 and over). One reason for this is that the 


257 
Tests of General Ability 


concept of general intelligence is more meaningful with children than 
adults. This point will be discussed later in the chapter. The Stanford- 
Binet does not have a high enough “ceiling” to measure the ability 
of superior adults. That is, the range of difficulty at the adult level is 
not sufficient to tap the ability of highly gifted individuals. Because 
no persons over eighteen years of age were included in the standard- 
ization sample, the norms for adults are suspect. 

5. Clinical utility. Presumably, a test like the Stanford-Binet 
should be judged, in the long run, by the success with which it pre- 
dicts different criteria. Many of the uses, however, to which this and 
other general intelligence tests are put are so subtle as to make direct 
empirical validation difficult. For example, the test might be used to 
decide what type of psychotherapy should be used with a disturbed 
child. Because of the difficulty in measuring therapeutic success and 


the difficulty in deciding the importance of “intelligence” for the out- 
how well the test works in the situation. 


al impression of how well the test 
esent be used as an indication of 


come, it is hard to determine 
In many situations only the clinic 
performs a particular job can at pr 
“validity.” Judging from the wide acceptance of the test in clinical 
settings, it is apparent that the Stanford-Binet is judged to be as 
valuable or more valuable than any other test for use with children. 


The Wechsler Scales 

The two Wechsler scales are competitive with the Stanford-Binet as 
individually administered tests of general ability. Unlike the Stanford- 
Binet, there are separate Wechsler scales for adults and children, which 


will be described in turn. 


WAIS. Wechsler began 
ability with the development o 


his work on the measurement of general 
f an adult test. The test frequently is 
used with students aged fifteen and older. The presently used form is 
called the Wechsler Adult Intelligence Scale (84), which was pub- 
lished in 1955. The adult scale was intended to differ from the Stanford- 
Binet in the following respects: 


1. The test items were to be more appropriate for adults, and repre- 


sentative norms for adults were to be obtained. 
2. Age levels were to be discarded in favor of a 
Which all subjects would take. 
3. Separate sets of verbal and performance tests W 


Structed allowing for both a verbal and performance 1Q. 
4. The 1Q was to be determined by a transformation of standard 


scores rather than through the use of mental age scores. 
Jeven subtests, which are described as follows: 


number of subtests 


ere to be con- 


The WAIS consists of e 


258 


Prediction and Trait Measurement: Human Abilities 


VERBAL SCALE: 


1. General information. The subject is asked twenty-five questions 
concerning a wide variety of facts. The questions are not intended to 
tap academic training or specialized branches of knowledge. They are 
meant to cover the kinds of information that any alert individual can 
learn from his cultural contacts. ; 

2. General comprehension. The test contains ten items concerning 
why certain social rules are necessary and how everyday problems are 
solved, 

3. Arithmetical reasoning. Ten problems of the kind that would 
be typically encountered in elementary school arithmetic are given. 
Both speed and correctness of response are scored, 

4. Similarities. The subject is asked to tell what is similar about 
twelve pairs of terms. This subtest is very similar to material found on 
the Stanford-Binet. 

5. Digit span. This is the familiar memory for digits which also 
appears at different levels of the Stanford-Binet, From three to mine 
digits are read to the subject, and he is asked to repeat them in their 
exact order. In the second part of the test, the subject is asked to repeat 
the digit series backward, 

6. Vocabulary. Forty words of increasing difficulty are presented. 
The subject is asked what each word means. 


PERFORMANCE SCALE: 


7. Digit symbol. This is an adaptation of the familiar coding test. 
The subject is given a sheet of paper on which nine symbols are paired 
with nine numbers. Farther down on the page a jumbled list of the 
numbers is given, and the subject is asked to write in the matching 
symbols, 

8. Picture completion. The subject is shown fifteen incomplete 
pictures and asked to describe the missing part in each, This is also 
very much like material found on the Stanford-Binet. 

9. Block design. The subject is shown a set of small blocks. Surfaces 
of the blocks are painted white, red, and red and white. The subject 
is presented with a picture of a design and asked to reproduce it with 
the blocks. Seven designs are given in turn. Both speed and accuracy 
are scored, 

10. Picture arrangement. The subject is handed a set of pictures 
and asked to arrange them in an order that tells a story. Six sets of 
pictures are given. Both speed and accuracy are scored. 

11. Object assembly. The subject is asked to put together three 


jigsaw puzzles, Each puzzle pictures some part of the human body. 
Both speed and accuracy are scored, 


259 
Tests of General Ability 


WISC. A separate scale is available for children, which is called 
the Wechsler Intelligence Scale for Children (83). The WISC is used 
for students aged five to fifteen, and the WAIS is used for all older age 
groups. The WISC is very similar to the WAIS, the major difference 
being that the WISC contains material more appropriate for, and more 
interesting to, younger people. The subtests of the WISC are as follows: 


Verbal scale Performance scale 


1. General information 6. Picture completion 
2. General comprehension 7. Picture arrangement 
3. Arithmetic 8. Block design 

4. Similarities 9. Object assembly 

5. Vocabulary 10. Coding (or mazes) 


Alternate: digit span 


On the verbal scale, digit span is given as an alternate if, for some 
reason, one of the other tests is not usable. On the performance scale 


| = — ee —4 


F igure 12.2 i pep ini: d performance materials from the Wechsler 
2.2. Child being administered P teri 

Intelligence 15 a for Children. ( Reproduced by permission of The Psychologi- 

cal Corporation.) 

as the choice of using either coding 

WISC is similar to the digit symbol 

is the only one that does not appear 


(see Figure 12-2) the examiner h 
or mazes. The coding test on the 
test on the WAIS. The mazes test 


260 


Prediction and Trait Measurement: Human Abilities 


on the adult form. It consists of ei 
creasing difficulty, 
number of errors. 
IQs on the WISC are determined in the same general manner W 
the adult test. As in the adult form. IQs can be obtained separately 1 
total scale, verbal scale, and performance scale, All IQs are simp!) 
transformed standard scores with a mean of 100 and a standard devia- 
tion of 15. dif- 
The manual (83) states the percentage of children that fall at 


aiea 188 
ferent IQ levels, with a verbal description of what particular scor 
mean (see Table 12-1), 


ght paper-and-pencil mazes of 10 
performance being scored in terms of both time an 


Table 12:1 Classification of IQs 
on the WISC (Reproduced by permission. 
Copyright, 1949, The Psychological 
Corporation, New York, N.Y. 
All rights reserved.) 


Per cent of 


Description IQ ranges children 
Very superior 130 and above 2:2 
Superior 120-129 6.7 
Bright 110-119 16.1 
Average 90-109 50.0 
Dull normal 80-89 16.1 
Borderline 70-79 6.7 
Mental defective 69 and below 2.2 


Evaluation of the Wechsler Scales. Most of the good things ue 
can be said about the Stanford-Binet apply equally well 55 „ Ak 
Wechsler scales. In fact the two types of tests correlate so high Jey 
most age levels that it is illogical to argue which is “better” in ere 
Although there are some advantages of one type of test over the den 
for particular purposes, the choice of whether to use one rather t 85 
the other type of scale often boils down to the personal preferences 


155 
of the test user. Some of the salient features of the Wechsler scales ar 
as follows: 


I. Administration. 


Although, like the Stanford-Binet, the Wechsler 
scales 


are individually administered, they are somewhat easier 19 
administer than the Stanford-Binet. The full WAIS or WISC usually 
can be administered in no more than one hour. ' 
25 Standardization. The manual provides relatively detailed in- 
structions for scoring each test. The norms were based on representa- 
tive samples of children and adults. The care that Went into the con- 


261 
Tests of General Ability 


struction of norms is illustrated by the normative samples used for the 
WISC. One hundred boys and one hundred girls were tested at each 
age level, giving a total of 2,200 children in the standardization sample. 

strenuous effort was made to choose a representative cross section 
of white children in the United States. The sample was drawn from 
eighty-five communities in eleven states. The distribution of subjects 
Closely resembled the country at large in terms of urban-rural propor- 
ton, geographical area, and parental occupation. The WISC stand- 
ardization sample is as representative as that used in almost any current 
measure of general ability. 

3. Reliability. The Wechsler scales are highly reliable; that is, the 
Pure chance factors influencing test results are quite small. Because 
no alternate forms are available, split-half reliability estimates were 
used at various age levels. The results at various ages indicate an over- 
all reliability for the total scale of about .95, which, to say the least, 
s quite good. 

4. Performance scale. 
la an advantage 
bor most students, the more 
throughout the Stanford-Binet a 
tests, are more predictive of 
Certain types of “unusual” children, 
Wechsler scale are quite helpful. They are helpful for children with 
Various types of language problems, with the deaf, and with children 
who have been reared in other countries. They also are useful with 
children who have led “impoverished” lives of a kind that would 
markedly hinder their performance on more verbal tests. Although 
the verbal and performance scales correlate highly, when a child shows 


ay . Ales i 18 rer 17 io 
very different scores on the two types of scales, it is of real diagnostic 


importance. 

5. Testing adults. Most will agree that the WAIS is better than the 
Stanford-Binet for testing most adults, particularly those adults who 
are well above average in general ability. The WAIS has items which 
are more appropriate for adults and more i 
test has a higher “ceiling” 
does not extend below the five-year level, the St 
the testing of preschool children. Between the ages of five and fifteen, 
it is hard to choose between the two tests. 

6. Predictive efficiency. Although not A 
done with the Wechsler scales to predict school achievement and 
vocational success, the available evidence indicates that the scales will 
mately as well as the Stanford-Binet. Respectable correla- 
been found with both grades in school and indices of voca- 


The performance items on the W echsler tests 
over the Stanford-Binet for certain purposes. 
“verbal” types of items, such as appear 
nd on the verbal scales of the Wechsler 
achievement in school. However, for 
the performance items on the 


nteresting to them, and the 
than the Stanford-Binet. Because the WISC 
anford-Binet dominates 


as much research has been 


do approxi 
tions have 
tional accomplishment. 


262 
Prediction and Trait Measurement: Human Abilities 


7. Clinical utility. As is true of the Stanford-Binet, many of the 
uses of the Wechsler by educational and psychological specialists are 
so subtle as to make direct appraisal quite difficult, and it is necessary 
to rely on impressions of how well the tests work. Evidently, the 
Wechsler scales are thought to be very useful in making decisions 
about problem students, 


Group Tests of General Ability 


Like the individual tests of general intelligence, the group tests are 
usually composed of verbal comprehension, numerical computation, 
and various mixtures of the reasoning factors, Although the tests differ 
from one another in appearance and sometimes in their factor composi- 
tion, they tend to correlate highly with one another, At the teen-age 
and adult levels the Sroup tests correlate highly with the individual 
tests, such as the WAIS. We see the interesting result that people who 
started off in seemingly different directions to compose intelligence 
tests ended up with rather similar measures. Where the group 
tests differ from one another is in their practical advantages, Some are 
longer and thus more reliable than others. Some have obtained norms 
in a careful and representative manner; others have only scant or mis- 
leading information on norms. Some have either higher or lower 
“ceilings,” making them more useful with one or the other extreme of 
ability. Considerable research has been done with some of the tests, 
and only “face validity” can be claimed for others. 

The Binet and Wechsler scales dominate the field of individual tests 
of general intelligence. Among group measures neither one nor several 
tests have dominated the field. The tests, therefore, which will be 
discussed here are only examples of the measures available. , 

Group Tests for Young Children. The youngest ages at which it 
has proved feasible to use group tests are the five- and six-year levels. 
Only small groups of approximately a dozen children can be tested a 
this way, and even then the examiner must exercise considerable skill 
to obtain the necessary cooperation and attention. Tests at this age 
level cannot employ written language, and the child cannot be expected 
to write his own responses. Test instructions must be given orally and 
supported by illustrations and gestures, 

One of the most widely used tests for young children is the Pintner- 
Cunningham Primary Test (61) which has been in use for over twenty 
years. The test is available in three equivalent forms, A, B, and C. Each 
form is composed of seven subtests which are added together to obtain 
one score. ( Illustrative items are shown in Figure 12-3.) 8285 

Equivalent-form reliabilities are found to be generally high 5 
groups of kindergarten and first-grade children, ranging from .83 to 89. 


263 
Tests of General Ability 


Test 2. Mark the prettiest girl 


31 


Test 7. Look at how each picture is drawn; make another one like 


it in the dots 
m A of the Pintner-Cunningham Primary 


Figure 12-3. Illustrative items from For 
court, Brace, & World, Inc.) 


Test. (Reproduced by permission of Har 


Correlations between the Pintner-Cunningham and the Stanford-Binet 
are usually about .80. In a group of 260 first-grade children, the 
Pintner-Cunningham correlated .63 with scores on a reading test. 
Group Tests for the Elementary School Level. As children progress 
through the elementary school levels, more and more written material 
can be employed in tests. In the first several grades it still is necessary 
to rely heavily on oral instructions and pictorial test materials. One of 
the best available tests for the elementary grades is the Lorge-Thorn- 
dike Intelligence Tests (52), which has tests at five levels ranging from 
those appropriate to kindergarten up to the twelfth grade. At all levels 
the tests emphasize verbal comprehension, number skills, and reasoning 
abilities, (Illustrative items are shown in Figure 12-4.) The test can be 
given in about thirty minutes. A particularly good feature of the tests 


264 
Prediction and Trait Measurement: Human Abilities 


D 2 f 5 


— 


Teacher says, "Circle the one showing the boy diving." 


2 z £ 
E 5 RG \ S ge 
= A esth A RRS 


"Circle the one showing a girl eating." 


MN 


Figure 12-4, Sample items from the Lorge-Thorndike Intelligence Tests, kinder- 


in 
garten and first. grade level. (Reproduced by permission of Houghton Mifflin 
Company.) 


Circle the empty box." 


is that up through grade 3 they are “nonverbal,” all the items ase 
ing pictures and designs. This is good because it provides a relative 7 
pure measure of “abstract” ability which is not highly dependent 5 
the fortuitous circumstances that influence the early ag nee 
reading skills, Alternate forms are available at each age level. T id 
manual of instructions presents simplified procedures that at 
permit any conscientious teacher to administer the tests in the class 
room. 

Even though the tests take 
highly reliable, Alternative- 
spectably high. The tests co 
some other group te. 
tests. 


only a short time to administer, they a 
form and split-half correlations are on 
relate well with the Stanford-Binet, yor 
sts of general ability, and with some achievemen 


General Intelligence Tests for Infants and Preschool Children 


i kouse f infant 
A separate section has been reserved here for the discussion of infa 


and preschool tests because of the special problems that are ange 
One of the pioneers in the testing of infants was Arnold Gesell. Sa 
over twenty years he and his colleagues performed longitudinal stuc 5 
of child development. A group of 107 infants was systematically ie 
served at four, six, and eight weeks and at every four-week jaro A 
to fifty-six weeks. The children were studied again at eighteen yont n 
and at the ages of two, three, four, five, and six years. On the basis ¢ 


265 
Tests of General Ability 


these observations the Gesell Developmental Schedules were prepared 
(29). They are intended to measure the following attributes: 


1. Motor behavior. How well the child can hold his balance, co- 
ordinate, stand, walk, and manipulate objects. 

2. Adaptive behavior. How well the child can solve the problems 
of his small world: obtain objects, remove obstacles, solve puzzles, and 
react to stimuli. 

3. Language behavior. How well the child can communicate, using 
the word in its broadest meaning, including the use of gestures and 
primitive words, to the later development of real language. 

4. Personal-social behavior. How well the child learns habits of 
personal care such as toilet training, dressing, and feeding himself. At 
a later age consideration is given to how the child manages himself 
in social situations and in play activity. 


During the first year of life, when the Gesell scales would supposedly 
have their unique value, most of the observations have to be made 
about motor behavior. The four-week-old infant cannot, of course, talk 
or follow oral instructions of any kind. The most that can be done at 
the infant stage is to watch the child’s spontaneous movements and 
note how he reacts to various stimuli. At 1.4 months the average child 
can coordinate his eyes on an object held before him. At 3 months he 
will make reaching movements for an object. At 5.5 months he will 
react differently to strangers than to his parents. (See Figure 12-5 for 
illustrative test materials.) Other tests for infants are the Cattell Infant 
Intelligence Scale (17), the California First-Year Mental Scale (3), 
and the Northwestern Infant Intelligence Tests (32). 

Analysis of Infant Tests. Tests for infants are difficult to stand- 
ardize, administer, and score. They are, of course, all individual meas- 
ures. Infant tests are less reliable than tests for older children. The 
reliability is considerably lower during the first six months than after- 
Ward. Several studies of different tests have found reliabilities around 
65 for testing during the first six months. After six months the reliabili- 
ties move up to respectable figures in the .80 to .90 range. Except for 
the first weeks and months, the infant tests do measure something 
Consistently. The question is, what do they measure? A real difficulty 
in validating infant scales is that there are almost no criteria available 
until the child enters school. A customary procedure has been to 
Correlate infant tests with scores made several years later on more 
established intelligence tests like the Stanford-Binet. Studies have 
shown that infant tests given at the age of one year or less correlate 
about zero with intelligence tests given five, ten, and fifteen years 
later to the same persons. It is obvious that infant tests do not measure 
Intelligence as it is customarily measured in older children. The key to 


266 
Prediction and Trait Measurement: Human Abilities 


Figure 12-5. Test materials used with the Gesell Developmental Schedules. 
(Courtesy of The Psychological Corporation.) 


this dilemma seems to be that the infant scales primarily measure 
motor and sensory abilities, and the research on reliability shows that 
there is some consistency in the development of these attributes. It 1s 
quite likely that the infant tests would predict motor and sensory skills 
later in life, but interestingly enough, almost nothing has been done 
to test this hypothesis. r 
Preschool Tests. Between the ages of two and five the developing 
intellectual processes become accessible to psychological tests. After 
the child develops speech, can manipulate objects, and becomen 
acquainted with the world about him, he can be tested with some o 
the materials that are customarily used in intelligence tests. However. 
many of the difficulties in test standardization and administration still 
remain. The test materials must be largely pictorial or consist of 
performance problems. 
The difficulty in testing infants is that they usually do little one Way 
or the other to indicate their intellectual abilities. Children between 
the ages of two and five do too much. They are so active and dis- 
tractible that it is difficult to carry on any formal testing procedures: 
Many children in this age group are shy with strangers and will give 
little if any cooperation to the examiner, They often are not highly 
motivated to impress the examiner or themselves with how well they 


267 
Tests of General Ability 


can perform. Consequently, the test must be posed as an interesting 
game to the child, and much depends on the examiner's skill. 

One of the most prominent tests for young children is the Minnesota 
Preschool Scale (33). There are two equivalent forms, each with 


Figure 12-6. Test materials used in the Minnesota Preschool Scale. (Reproduced 
by permission of Educational Test Bureau, Minneapolis.) 


twenty-six items. Some of the items are as follows (see Figure 12-6 
for an illustration of some of the testing materials): 


Pointing to parts of the body on a doll 
Telling what a picture is about 
Naming colors 

Digit span 

Naming objects from memory 
Vocabulary 

Copying simple geometrical designs 
Block building 

Jigsaw puzzle 

Indicating missing part in pictures 


SoePASMaAk ahr 


— 


Many of the items are similar to those at the lower age levels of the 
Stanford-Binet. The instrument is largely a power test with no 
emphasis on speed, and the items are little concerned with motor skills. 
Tests at this age level which depend on speed and motor skills are 
probably poorer measures. 

The Minnesota scale was standardized on a group of 900 children 
ranging in age from 114 to 6 years. Equivalent-form reliabilities of the 
total scale vary from .80 to .94. There are some reasons to believe that 


268 


Prediction and Trait Measurement: Human Abilities 


the Minnesota scale is not an entirely adequate measure below the age 
of three. Although scores for children above three tend to 1 
highly with Stanford-Binet scores obtained later, tlie correlation or 
children below the age of three is only .21. Also, clinical experience 
indicates that some of the test materials are not sufficiently interesting 
to hold the attention of children below the age of three, Two other 
preschool tests are the Intelligence Test for Young Children (80) and 
the Merrill Palmer Scale (74) 


The Nature of “General Intelligence” 


It has been shown that in spite of the separable factors that underlie 
tests of human ability there is a common ground among ability func- 
tions that can be reliably and usefully measured. The “verbal” intel- 
ligence tests, both individual and group, generally correlate highly 
enough with one another for us to speak of their common charac- 
teristics. If a marked difference in test scores is found between two 
kinds of people with one of the tests, the difference would most likely 
be reflected in the others. The wide use of intelligence tests during the 


last half century has shown a number of things about the underlying 
process: 


15 Intelligence cannot at present be 
age of two years and not very 
Earlier in the chapter the 
and preschool children we 
show an improve 


measured in children below the 
well below the age of five or six meas: 
difficulties of constructing tests for infants 
re discussed, Perhaps the next decade will 
ment in the early measurement of general ability. : 

2. The concept of “general intelligence” is more meaningful with 
children than adults, Although the evidence on this point is somewhat 
conflicting, it seems that abilities are more “general” in children, Com- 
parative factor analyses of children and adults tend to show that there 
is more of a tendency toward one general factor in children. Some of 
the factors which are found in adult populations are difficult to find 
at all in children. This may be due in part to the fact that different test 
materials have to be used with children than with adults, However, the 
weight of the argument is that abilities 
The factorial diversity of abilitie 
ferent life experiences 


are more “general” in children. 
s in adults is probably due to dif- 
and different kinds of school and vocational 
training. It makes more sense to use general intelligence tests with 
children than adults. 

3. Intelligence as measured by current “verbal” tests is partly due 
to heredity and partly due to environment. The most telling arguments 
for this position come from the studies of resemblance between family 


269 
Tests of General Ability 


members in intelligence. Conrad and Jones (18) administered intel- 
ligence tests to over two hundred families in rural New England. They 
found that for children above the age of five years, intelligence of 
parents and children correlates .49. Numerous other studies have found 
correlations very close to .50. Conrad and Jones also found a correla- 
tion of 49 between siblings (between brothers, between sisters, or 
between brothers and sisters). Roberts (66) pointed out that a correla- 
tion of .50 between siblings is what would be expected from multi- 
factor inheritance. It is possible that the correlations which have been 
found between family members could be due to environment rather 
than heredity. Family members tend to share the same environment, 
talk about topics in common, and have similar kinds of schooling. 
Environment may explain part of the resemblance, but studies of twins 
make it apparent that this is not a complete explanation. Correlations 
between the intelligence test scores of fraternal twins (dizygotic) are 
usually higher than between siblings, ordinarily ranging from .50 to .70. 
Correlations between identical twins (monozygotic) are usually 
around .90—almost as high as the reliability of the tests! Fraternal 
twins can have very different genetic structures but the genetic struc- 
tures of identical twins are exactly alike. This leaves little doubt that 
at least a portion, and apparently a sizable portion, of intelligence is 
due to inheritance. 

4. After the age of about six years, the individual's intelligence tends 
to remain stable with respect to his age group. That is, the superior 
People at one age level tend to be the superior people at other age 
levels (see Figure 12-7 for evidence on this point). The relationship 
is far from perfect, and isolated individuals may show drastic changes 


Over a period of years. 

5. There are definite group differences in intelligence test scores. 
Lower scores on the average are made by people of low socioeconomic 
Status, people living in rural areas, people living in the Southern or 
Southwestern part of the United States, immigrants from southern 
Europe, and Indians and Negroes. The interpretation of differences of 
this kind places a large strain on the logical foundation of intelligence 
tests. As was discussed previously in this chapter, the traditional 
Measures of intelligence are constructed in such a way as to favor 
Certain groups. The question is whether or not the subgroups which 
tend to make lower scores would make high scores if afforded the 
advantages of the wider culture. There is some evidence to show that 
they would. It was found (46) that Southern Negro children who 
Migrate to New York make higher scores the longer they are in the 
city environment, Another point that should be considered is that 
even though some subgroups score lower on the average than others, 


270 
Prediction and Trait Measurement: Human Abilities 


at least some high-scoring individuals can be found in each. The 
whole question of ethnic and socioeconomic differences is highly 
charged with emotion in both professional and lay circles and is a 
point about which much more information needs to be obtained. 


1.00 | 
10 yr tests 
8,9 yr tests 
1 
80 
s 
3 
§ 60 
2 
2 
8 
S 
© 40 
2 
5 
oO 
20 
00 L | | = 


6 7 8 9 10 1213 14-15 
Age at later test 


Figure 12-7. Effect of age at initial testing and test-retest interval on prediction 


of later Stanford-Binet IQ from earlier test. (Adapted from Honzik, McFarlane, 
and Allen, 43.) 


6. Many different kinds of attainment involve intelligence as meas- 
ured by traditional “verbal” instruments. In particular, intelligence is 
one of the major factors in successful schoolwork. Numerous correla- 
tions of .50 and above between particular tests and school grades were 
cited in this chapter. Intelligence test scores differentiate occupational 
groups (see Table 12-2). However, it should be noted that there is 
considerable overlap among most groups. Comparing the extremes in 
Table 12-2, the top 10 per cent of lumberjacks score higher than the 
lower 10 per cent of accountants. Scores on intelligence tests are pre- 
dictive of success on many but not all jobs (see Table 12-3), The fact 
that the correlations between test scores and job success are near zero 
in some cases does not necessarily mean that intelligence is not 
important for the job. The dispersion of intellectual ability usually is 
narrowed considerably in most jobs by the individuals gravitating 
toward a job at which he can work comfortably and by the selection 


271 
Tests of General Ability 


Table 12-2: AGCT Standard Scores 
of Occupational Groups 
in the Second World War 

(Adapted from N. Stewart, 69) 


Percentile 

Occupational groups 10 25 50 75 90 

Accountant 114 121 129 136 143 
Teacher 110 117 124 132 140 
Lawyer 112 118 124 132 141 

Bookkeeper, general 108 114 122 129 138 
Chief clerk 107 114 122 131 141 
Draftsman 99 109 120 127 137 
Postal clerk 100 109 119 126 136 
Clerk, general 97 108 117 125 133 
Radio repairman 97 108 117 125 136 
Salesman 94 107 115 125 133 
Store manager 91 104 115 124 133 
Toolmaker 92 101 112 123 129 
Stock clerk 85 99 110 120 127 
Machinist 86 99 110 120 127 
Policeman 86 96 109 118 128 
Electrician 83 96 109 118 124 
Meateutter 80 94 108 117 126 
Sheet metalworker 2 95 107 117 126 
Machine operator 77 89 103 114 123 
Automobile mechanic 75 89 102 114 122 
Carpenter, general 73 86 101 113 123 
Baker 69 83 99 113 123 
Truck driver, heavy 71 83 98 111 120 
Cook 67 79 96 111 120 
Laborer 65 76 93 108 119 
Barber 66 79 93 109 120 
Miner 67 75 87 103 119 
Farm worker 61 70 86 103 115 
Lumberjack 60 70 85 100 116 


Procedures that are used in industrial settings. Also, it is important to 
Note that the variability of intelligence test scores is higher for lower- 
level than for higher-level jobs. This indicates that, as would be ex- 
Pected, intelligence is a more important determiner of success in high- 
evel occupations. The individual's ability to succeed is determined by 
his intelligence and by a host of other things as well: abilities not 


272 
Prediction and Trait Measurement: Human Abilities 


Table 12-3: Median Validity Coefficients of Intelligence Tests for 
Various Occupational Groups in the Prediction of Job Proficiency 
(Adapted from E. E. Ghiselli and C. W. Brown, 31, p. 577) 


Occupational Median validity Number of 
group coefficient validity coeficients 
Clerical workers 35 85 
Supervisors 40 9 
Salesmen 33 4 
Sales clerks —.09 18 
Protective service 25 6 
Skilled workers 55 6 
Semiskilled workers -20 45 
Unskilled workers 08 13 


measured by intelligence tests, interests, personality traits, and just 
plain luck. 


Using Tests of General Ability in Schools 


At a number of places so far in this book it has been emphasized 
that tests are useful only to the extent that they help in making deci- 
sions. Tests of general ability are potentially useful in helping to make 
decisions about (a) grade placement, (b) ability grouping within 
grades, (c) special instruction, (d) counseling, (e) vocational guid- 
ance, and (f) planning for higher education. Some of the important 
principles for using tests of general ability in these ways are described 
as follows: i N 

Choice of Tests. In this ch 
have been described, Which or 
particular kinds of 
answer the question, 


apter numerous different types of tests 
nes should be used for helping to make 
decisions? There are some rules that partially 


Most schools employ standardized tests of achievement. As has been 
mentioned in a number of places in the book, tests of general ability 
tend to correlate highly with achievement tests, Consequently, tests of 
general ability should be used only if they add something to what can 
be obtained from achievement tests alone, Tests of general ability (at 
least the good ones) definitely add something with children from five 
to about ten. In those years, the child is still getting used to school, 
and he has not had enough “book learning” for achievement tests to 
accurately mirror what he may accomplish Tater. Because most tests of 
general ability, particularly at the earlier age levels, are relatively more 
concerned with “abstract” ability, they are more prognostic of later 
achievement than are achievement tests given in the primary grades. 


273 
Tests of General Ability 


Because the “abstractness” of many measures of general ability is 
an asset in dealing with children from five to ten, the more abstract the 
test, the better. At those age levels, the Stanford-Binet and the WISC 
are semi-independent of school learning, which is why they are highly 
recommended. The fact that the Lorge-Thorndike group test (de- 
scribed earlier in this chapter) is “nonverbal” up to the third-grade 
level is considered an advantage. It can be expected that, in order to 
add something to what can be obtained from routinely administered 
achievement tests, tests of general ability will become even more 
“abstract.” One step in this direction is to employ more items concern- 
ing figural relations such as those illustrated early in this chapter. 

The choices of whether to use group or individual tests in general, 
and when to use one rather than the other, are relatively easy to make. 
The group tests simply are far too expensive to be routinely given to 
all children. Consequently, most schools will have to rely on one of the 
group tests of general ability for routine testing of students. The 
individual tests are necessary, and worth the expense, when the child 
Constitutes a problem in some sense. 

For several reasons, tests of general ability are more valuable in the 
elementary grades than in high school. One reason is that, as was 
argued previously, the abilities of children are more “general”; there- 
fore, one general measure will tell much about the child. Beginning 
with the teens it is advisable to use one of the multifactor batteries 
when possible. Another reason is that much of the need for a test of 
general ability is over by the time the student is well along in high 
school. The major value of such tests is in getting students off to a 
good start in school, and by the time the student is in high school, 
much of the “good” or “evil” has already been done. A third reason is 
that tests of general ability for teen-agers and average adults (in 
Contrast to some of those for young children) are so saturated with 
“book learning” that they closely resemble tests of achievement and 
tend to add only a small amount of additional information. At the 
early age levels, the school needs both measures of general ability and 
achievement, In the later years of high school, the need is for either 
a measure of general ability or a measure of achievement. Although 
even at the upper age levels most tests of general ability still add some- 
thing to routinely employed measures of achievement, it is a real ques- 
tion Whether most schools can pay the price for the relatively small 
amount of additional information obtained. 

Counseling and Guidance. Tests of general ability would be worth- 
While if for no other reason than the important part they play with 
Problem children. In a typical problem the first-grade child is over- 
Active, runs around the room rather than working on exercises, will not 
follow instructions, and is apparently incapable of doing first-grade 


274 


Prediction and Trait Measurement: Human Abilities 


work. The teacher suffers it for a month and then tells the pind 
something has to be done. The principal contacts a school psychologist, 
who administers the Stanford-Binet as one part of the clinical cea 
What the test shows has an important bearing on the child. If — 
score is very high, he may be restless out of boredom and need — — 
challenging fare. If he is near average, he may have emotional Ps 8 
lems centered in the home, or he may be poorly disciplined. If es 
score is very low, he may be totally unable to master the first Bi 5 
and be causing trouble because he is frustrated and angry. Regard 
of what the tests show, they supply one important type of informatio! 
to help in making decisions about problem children, inë 

Individually administered tests are particularly helpful in dea me 
with problem children, Many children who are too disturbed to per 
form well in class or on achievement tests oft 
carefully drawn out by the expert ex 8 
situation provides the examiner with many opportunities to N 
the emotional behavior of the child, his habits, and his methods o 
approach to problem-solving situations, -al 

Curriculum Management. Some school systems use tests of gwo 
ability to decide when children may be admitted to the first grade. is 
children score high enough, they can be admitted when they are on 
514 or even younger, Although the practice is controversial, the 77 
probably outweighs the potential dangers involved, Children ile Je 
same chronological age are not of the same mental age. Basing a a 
placement on chronological age is, if anything, less logical than basing 
it on mental age obtained from a good test. in ability 

An even more controversial Practice is to place children in a i “4 
groups within particular grades. There are both very good and 175 e 
bad things to be said about the practice. On the bad side the . 
potentially can (a) make the slow learners feel inferior, (b) place 5 
unhealthy emphasis on intelligence, and (c) deprive all students ¢ 
the Opportunity to learn about other chil 
about their “own kind.” On the 
can (a) let every student learn 
along or being dr 


> J; en 

en show high ability we 
3 ss estin: 

aminer, Also the individual testing 


dren in general rather me 
good side, ability grouping are 
at his own pace without either draggin 1 
agged along by the others, (b) let slow learners W 8 
with students of their own level of ability, and (c) simplify the teache h 
work by giving him students who can do, and are interested in, pr l 
the same materials. No attempt will be made here to reach a ae 
decision as to whether the potentially good features of ability grouping 
outweigh the potentially bad ones; but, should ability groups be 2115 
tests of general ability are very helpful. Up to about the fourth 15 
fifth grade, tests of general ability offer the best measures currently 


available for grouping children in terms wi wt After that point. 
2 Ji 2 
standardized achievement tests do as well or better 


275 
Tests of General Ability 


, Classroom Instruction. To the child, “school” means the day-to-day 
interactions with his teacher and with fellow students. How can, and 
should, tests of general ability interact with that all important (for the 
child) microcosmos? Some argue that teachers and students would be 
better off if IQ tests had not been invented, and there is some truth in 
the arguments. On the negative side, teachers often display a naive 
faith in the tests. They forget that they are man made and only as good 
as men know how to make them at the present time. 

Teachers must be careful not to let the IQ become an index of value 
or moral goodness. Good character, sportsmanship, pleasantness of 
personality, and most other desirable personal attributes are not 
related to measures of general ability. If the teacher wants to single 
Out a student who is in need of special attention and consideration, 
15 is the child with a low IQ. School will always be an uphill fight for 
him, and he will have to settle for less in life than his brighter school- 
mates can expect. If the teacher has any energy left from trying to love 
them all, that last ounce of affection and concern should go to the 
child who needs it most. 

Tests of general ability give the teacher an approximate idea of what 
to expect from each child. If the child has a very high score, he can be 
expected to perform well beyond the average. If he has a very low 
Score, he may need special instruction. However, teachers must re- 
member that ‘general measures are just that—they do not indicate the 
Particular ways in which the child is more and less bright. If the child 
makes a very high score, he probably can perform well in most school 
topics, If he ikes a very low score, he probably will have trouble in 
most school topics. However, most children are not at the extremes, 
and for them tests of general ability leave many unanswered questions. 
Although all abilities tend to go together, there are definite exceptions. 
Consequently, one child may show an average IQ but be high in 
mathematical ability and low in other respects, and many other pat- 
terns of ability are possible. Teachers should look for the particular 
abilities of the child rather than rely solely on one index of general 


ability, 

Tests of genera 
ee pore s seldom become creative 
adults, there is far from a one-to-one correspondence between 1Q and 

average students. Some students with very hi sh 

e a se to be creative adults, and some students with 8 y 
3 105 produce truly creative works. (Chapter 14 will 
disne anette of creativity.) Tests of general ability are Diani d 
useful for predicting how w ell children will do in school, N 
s of gener al ability to predict succe 

hat such tests are primarily use 


ability do not necessarily measure creative poten 
ats who make low score: 


Creativity for above- 


ss in school, teachers 


In using 8 
g test fu r pre gan 
l fo pr dictin and 


must remember t 


276 
Prediction and Trait Measurement: Human Abilities 


making decisions about the next step. A test of general ability given to 
a five-year-old is highly predictive of how well he will perform in the 
first several grades. A test of general ability given at the beginning of 
junior high is quite predictive of progress during the ensuing several 
years. However, tests of general ability are not highly predictive of 
performance years hence. For example, the correlation between meas- 
ures of general ability given to five-year-olds and successful perform- 
ance in college is nil. 

Although the mental abilities of children change quite slowly, they 
do change. Mental abilities of children are not highly “crystallized. 
The child who appears only average at age six may appear superior at 
age eighteen. The child who appears above average at seven may ap- 
pear below average as an adult. Children grow at different rates, both 
physically and mentally, and it is no more sensible to expect a test o 
general ability given at age six to be an infallible judge of adult ability 
than it would be to expect a measure of height at that age to be an 
infallible judge of how tall the adult will be. As is true of all tests, 
more faith should be placed in the extremes. The child who obtains a 
score typical of mental defectives is probably always going to have 
trouble. The child whose score is typical of only the top 1 per cent of 
the population is probably going to do quite well in school. In be- 
tween it is rather hazardous to make long-range forecasts with tests 
of general ability. Rather, the results of such tests should be considered 


primarily as good indications of how well children can perform during 
the ensuing several years, 


S ummary 


The purpose of intelligence tests is to measure abstract ability, 
which is distinct from actual accomplishment in school as evidence 
in achievement tests and teacher-made examinations. The potentia 
usefulness of measures of abstract ability is that they could forecast 
how well some children might do if their home and school environ- 
ments were improved, Also, they point to children who are, in a sense, 
performing better in school than they “should,” children who either are 
highly motivated to achieve or who overly impress teachers with their 
accomplishments. These two types of children are often spoken of as 
“underachievers” and “overachievers,” respectively. It is very impor- 
tant to know about both types (regardless of what is done about 
them), and potentially, intelligence tests can help spot these children- 

Unfortunately it is not possible to measure abstract ability entirely 
apart from actual achievement in school. Many of the items on in- 
telligence tests and achievement tests are very similar, and the two 
types of tests correlate highly. However, in those few cases where 


277 
Tests of General Ability 


large differences are found between scores on the two types of tests, 
this can provide very important diagnostic information. 

The content of most tests of intelligence largely was determined by 
the intuitions of the test constructors, plus statistical analyses of test 
data. In spite of their intuitive beginnings, all the tests tend to share 
some common properties. They all capitalize on several factors of in- 
tellect, particularly verbal comprehension. The dominant factors in 
the tests tend to be the ones that are most involved in successful per- 
formance in elementary and secondary school. Because of the simi- 
larity of their content, all the tests tend to correlate rather highly with 


One another, and choices among them for particular uses often must 


be made on practical grounds. 

In schools, intelligence tests have many uses, if they are employed 
Wisely. Teachers must realize that intelligence tests are not intended 
to measure what students have actually accomplished, and they are 
Not perfect predictors of future accomplishment. The value of intelli- 
gence tests is in providing clues about underachievers and over- 
achievers. It is particularly important for teachers to realize that the 
1Q is not the only important aspect of intellectual potential and per- 
sonal worth. The IQ does not (4) indicate special abilities, (b) pro- 
vide a sure index of creativity, or (c) strongly relate to personality and 
character. 


chapter 1 3 


Special Abilities 


There are many human attributes to measure other than the kinds of 
abilities which were discussed in the previous two chapters. There = 
talked about the components of intelligence or intellectual ability, 
which are usually thought of as the “higher processes,” the most eee 
of abilities. However, individual differences are as easily found, ang 
sometimes as important to stud 0 and 
artistic abilities. ; 

In order to fully understand an individuals potentialities and liabili- 
ties, much must be learned about him in addition to his intellectu@ 
capabilities. Two persons could make the same score on a genera 
intelligence test and yet be very different in other important ways: 
Similarly, for all the factors which were described in Chapter 11. e 
persons could have exactly the same profile of scores and differ im- 
portantly in terms of other abilities. One student might be underweight 
and frail, the other an excellent physical specimen. This would make 
quite a difference in athletic activities, and because young boys are p 
concerned with physical prowess, the difference would be important 
in relation to the social adjustment of the two boys. One student 
might have a flair for mechanical work, and another might have little 
interest or ability. This difference would be important to a high schoo 
counselor in helping students choose future vocations, Two students 
with about the same over-all scholastic aptitude might differ in that 
one would have intense interests in graphic art and show a creative 
touch in art work as compared with little artistic aptitude on the part 
of the other, The difference might be important in the school adjust- 
ment of the two students and in their plans after high school. Even Í 
a student’s basic scholastic aptitude is above average for his class, he 
might have a strong disadvantage because of a hearing deficit. The 
deficit would require special methods of instruction for the student 
and would make it difficult for him to study and play as a normal 


in sensory, motor, mechanical, 


0 


279 
Special Abilities 


child. These are only a few of the ways in which the “special” abilities 
interact with the “intellectual” abilities to influence school progress, 
social adjustment, and future careers. 


Vision 


The popular practice of talking about “good” and “poor” eyesight 
does considerable injustice to the complexity of visual functions. There 
are a number of separable and only partially related kinds of “good” 
Vision, A primary distinction must be made between near acuity and 
far acuity, Near acuity concerns how well the individual can discern 
visual forms within 1 or 2 feet of his eyes. Far acuity concerns how 
well the individual can discern visual forms placed 20 or more feet 
away. A third component of “good” vision is depth perception, the 
ability to judge the proximity of objects to one another. Another com- 
Ponent is the ability to distinguish colors. Although it is commonplace 
to think of color blindness as a unitary characteristic, there are differ- 
ent kinds of color blindness. Also, the ability to distinguish colors is 
partly a matter of degree rather than an all-or-none attribute. 

Good near acuity and far acuity (obtained with glasses if need be) 
are absolutely essential to adequate classroom performance. The child 
With poor far acuity will have difficulty in many ways, including (a) 
discerning material on the chalk board or on other visual displays, (b) 
following the hand signals of teachers in art and music exercises, and 
(c) watching the ball in play activities. The child with poor near 
acuity will be crippled in dealing with any kind of written material. 
He cannot read because he literally cannot see the printed words. 
Also, he will have difficulty in learning to write well, because, being 
unable to see what he himself tries to write, he will have difficulty in 
correcting his spelling errors and poor penmanship. 

Wall Charts. The most familiar measure of visual ability is the 
ordinary wall chart, which nearly everyone has encountered in apply- 
ing for a driver's license or taking a physical examination. The Snellen 
chart is used extensively for that purpose. It consists of rows of letters, 
each row containing smaller letters than the one above it, The chart is 
placed 20 feet from the student. If he can read the row of letters that 
the average student can, he is said to have 20/20 vision. If he needs to 
stand 20 feet from the chart to read the row that the average student 
can read from 40 feet, he is said to have 20/40 vision, 

Although the Snellen chart is an adequate device for dete 
deficiencies in visual acuity, it has a number of 
all the wall charts, it tests only far acuity. 
excellent far acuity and still have 
kinds. Some alphabetical letters are 


cting gross 
disadvantages, Like 
A school child could have 
crippling visual defects of other 
easier to distinguish than others, 


280 
Prediction and Trait Measurement: Human Abilities 


and this is not taken into account when using the Snellen chart. Also, 
the rows of letters are easy to remember, and the test can often be 
“faked” by the person who has some prior knowledge of the chart. The 
amount of light on wall charts should be carefully controlled, but i 
much practical work this is given little consideration. When e 
conditions are obtained, the reliabilities of the Snellen and other wa 5 
charts are satisfactorily high. One study (73) reports a reliability 
coefficient of .88 for the Snellen chart, 

Color Vision. One of the oldest tests for color vision is the Hole. 
gren Woolens. The subject is given different colors of yarn and asked 
to sort the ones that are alike. It is a crude test which serves only to 
distinguish persons who are very deficient in color vision, A more aye 
tematic measure can be obtained with the Ishihara color plates 
(Stoelting Co.). The plates are composed of small patches of ae 
The person who has good color vision can see a number on the pla a 
formed by patches of a particular color. The color-deficient ee 
either does not see the number or sees a different number. More gaoa 
color-vision tests are the Farnsworth Dichotomous Test for Color 
Blindness (25), the Farnsworth-Munsell 100 Hue Test (26), and 
Illuminant-Stable Color Vision Test (28). In order to keep color-vision 
tests adequately standardized, they must be used in the same illumina 
tion and protected from fading or soilage. je 

Multiple-component Tests of Vision. In recent years devices hav 
been constructed which test a number of different aspects of 8 
The three best-known instruments are the Ortho-Rater (Bausch 5 
Lomb), the Sight-Screener (American Optical Company), and 8 
Telebinocular (Keystone View Company). Each of these instrumen 
tests near vision, far vision, depth perception, color discrimination 
and control of the eye muscles. In general, the multiple-componen 


R . single- 
instruments represent a considerable advance over the older, sing 
component tests, 


Audition 


The sense of hearing is also composed of a number of different 5 
tions. Only auditory acuity will be treated here, the ability to agen 
faint sounds, Auditory acuity is itself complex: the person who ca! 
hear well at one tone level may be near deaf at higher or lower fre- 
quencies. Because of their relevance to musical aptitude, some of the 
other auditory functions will be considered in a later section. Pr 

The older tests of auditory acuity employed sound sources me 
whispered speech or the ticking of a clock. In the whispered speec” 
test, the examiner stands some distance from the subject and whispers 
a number of words. The subject tries to say what each word is in turn. 


281 
Special Abilities 


The examiner walks farther and farther from the subject to determine 
the distance at which the whispered words can be heard. Although 
tests of this kind are adequate for detecting gross losses of hearing, 
they have a number of defects. It is difficult to standardize both the 
loudness and clarity of whispered speech. One examiner will inevitably 
whisper a bit louder and/or clearer than another in spite of the best 
efforts at standardization. Such tests measure auditory acuity within a 
Narrow range of the tone, or frequency, continuum. The person who 
can hear whispered speech might not be able to hear sound at a 
different frequency, such as the sound of a ticking clock. The differ- 
ence in the acoustical properties of testing rooms and the problem of 
ruling out extraneous noises add to the difficulty of standardizing this 
type of test. 

A number of instruments have been developed for measuring audi- 
tory acuity at different points on the frequency continuum. These are 
called pure-tone audiometers (see 21, 81). Earphones are used to test 
one ear at a time, The standard procedure is to gradually raise the 
sound intensity until the subject indicates that he can hear the tone. 
Then, starting with a sound that the subject can hear well, the inten- 
sity is lowered to a point where it can no longer be heard. The pro- 
cedure is repeated at different frequency levels. The resulting data 
can be plotted as a profile of auditory acuity (see Figure 13-1). 


E 
| 


— — 


| — 


10 —— 


Normal o 


| 
1 —- 


| 


» Left ear | — 1 
| 
| 


Right eor | = 


$ 
| 


Heoring loss in decibels 


| 
2 
j 
1 
fi 


S S S 22 2 2 2 
8883838888 
f 


i L 
— r E T 52 1024 2048 4096 | 8785 1 
2896 5792 11584 
Frequency, cycles 
Figure 13-1. Audiogram of a child with severe high-tone deafness. (Adapted 
from L. A. Watson and T. Tolan, 81.) 


Pure-tone audiometers are now available for group testing, Ear- 
phones are given to all the subjects. Standard answer sheets are used 
for the subjects to indicate whether or not they hear the tone at differ- 
ent intensities. It is not possible to determine the individual's auditory 
acuity as finely in the group testing situation. The individual who 
shows a marked loss in any frequency range on the group test should, 


282 
Prediction and Trait Measurement: Human Abilities 


if possible, be given the individual test to determine more accurately 
the nature of his hearing deficiency. f 
A short case history will help show the nature and imponen — 
hearing deficits. The teacher has difficulty in understanding the rat 0 
unusual behavior of Billy Aiken. Some of his symptoms are he 15 
often fails to reply when asked questions; (b) makes no ane 15 
find out what is going on in the classroom; (c) mutters to himself 5 
seldom talks to other children; (d) when other children talk to oe 
he usually smiles but looks embarrassed and does not reply; and 55 
although seemingly bright in some ways, his over-all pertinence 
class is well below average. The school guidance counselor ee 
that Billy might have a hearing loss, and he requests the penn 4 
have a thorough audiometric examination, The tests show that n 
has a severe hearing loss, particularly in the frequency range of Bai 
speech, and that the loss cannot be corrected by medicine or sorgen 
Billy must wear a hearing aid permanently in order to have a 
auditory acuity. To help Billy in the classroom, the teacher Pe 
remedial instruction in areas where Billy is behind the class, and A 
gineers classroom activities in such a way as to help Billy deve p 
confidence and learn to enjoy social participation. The goian 
counselor talks with Billy’s parents about ways to help in his reme¢ 


instruction and Ways to encourage his social participation. 


Practical Uses for Sensory Tests 


The individual with a visual or hearing defect is at a definite da 
advantage in many performance situations. This is particularly M it 
the young school child who is handicapped. It is sometimes foun¢ 40 
poor vision or hearing is responsible for apparent dullness. The Ie 
who cannot see tlie chalk board or cannot hear the teacher väll Ji j 
very little regardless of his latent capacity for learning. Renan par 
culties can often be traced to visual defects, In order to spea itate 
rectly, the child must imitate the speech of others. Te Se re 1155 
what he cannot hear. The child with poor auditory acuity has dif 1 199 
learning to read by the use of phonics. To do well on most tests, ao 
child must be able to hear the instructions and to read the test isc be 
As is common practice in many school programs, children skou ua 
systematically tested at different age levels for auditory and vis 
acuity. isual oF 

Even the sophisticated adult is often unaware that he has a on 
auditory defect. It often happens that individuals who wear g Gee 
for the first time are surprised that the world “looks that way. er 
with poor far acuity often take ie for 1 that e . ce al 
more than 50 feet away as “big blurs. Consequently, the individua 


283 
Special Abilities 


cannot be relied upon to detect his own sensory difficulty and seek 
help. He is likely to blame his inability to perform well on dumbness 
rather than on a visual or hearing loss. 

Sensory ability is necessary for many different occupations. The 
classic example is the baseball umpire’s dependence on “good eye- 
sight.” Color discrimination is paramount to the interior decorator, the 
tailor, and the artist. A moderate level of auditory acuity is necessary 
for most jobs, particularly if it is required to talk with others or to 
follow spoken instructions. However, there are few jobs in which high- 
level sensory ability is the primary attribute. Even jobs that would 
seem to depend heavily on sensory acuity usually require only average 
ability, A typical example is that of a sonar operator, who uses a sound 
echoing device to detect hostile submarines. It was found that up to a 
certain point auditory acuity is a “must” for the job; but above the 
level of average hearing, auditory acuity is not a predictor of good and 
Poor sonar operators. , 

Sensory disabilities are often prominently involved in adjustment 
problems. It is not uncommon for the student who is partially cut of 
from his environment because of a sensory disability to become with- 
drawn, depressed, and resentful. Sensory tests can be used to detect 
difficulties in audition and vision. Correction or compensation for the 
sensory disability often leads to an improvement in the student's 
personal relations. 


Mechanical Aptitude 


Mechanical ability is popularly thought of as concerning the making 
and fixing of things as distinct from clerical, sales, administrative, and 
professional work. We generally speak of mechanical ability in relation 
to trades and various levels of skilled work. There is no fine dividing 
line between mechanical occupations and those that are not mechani- 
cal. Some occupations that most of us would classify as mechanical are 
plumber, carpenter, automobile mechanic, and television repairman. 

Most teachers have little direct contact with mechanical aptitude 
tests. They mainly are of importance in the vocational guidance of 
high school students who are considering one of the skilled trades as 
a life pursuit. It is seldom the case that the school itself will have and 
use many of the tests required. Rather, such tests usually are adminis- 
tered by industrial or governmental agencies for selecting promising 
applicants for specialized programs of training. School counselors, and 
teachers in general, often are called on to help advise students about 
such vocational choices, and it is wise for them to learn somethin 
about the nature and use of mechanical aptitude tests. 8 

There is no one type of test function which underlies mechanical 


284 


Prediction and Trait Measurement: Human Abilities 


work to the same extent that the general intelligence tests relate i 
schoolwork. In order satisfactorily to predict a particular me up 
job, a range of different kinds of tests must be used in a battery. 5 
ferent combinations of tests are usually needed for different jobs. en 
of the kinds of tests that have proved useful in the ee mr 
mechanical work and in vocational guidance are described in 
following sections. i king 

Intellectual Ability. Because an individual is involved in ne Me 
and fixing things, it does not mean that intelligence is an unimpor ta K 
attribute. When it is possible to do so, either a battery of the 15 
r at least a general intelligence test should be 125 
as a predictor. The Spatial and perceptual factors are very usefu 5 
predicting many mechanical jobs. Some tests embodying these fun 5 
tions will be considered in a later discussion. However, it is aces 
also to consider the verbal, numerical, and reasoning factors as well 2 
the prediction of mechanical work. Because the “verbal” intelligent 
tests are mainly composed of these factors, a good intelligence test! 
often one of the best predictors of job success (see Table 13-1). 


Table 13-1; Correlations of Abili 
with Measures of Job Performance 
(Adapted from E. E. Ghiselli, 30) 


ty Test Scores 


Test Type of job —— 
Protective Sk illed Semi- 1 
Clerical service trade skilled t e 
General intelligence 36 28 45 20 16 
Arithmetic -42 12 5 i 15 
Number comparison 28 .25 Ang 15 15 
Spatial relations 06 os 45 30 27 
Mechanical principles Sie 8 45 25 
Finger dexterity 22 20 21 30 05 


Intelligence tests tend to be more Predictive of how well the indi- 
vidual does in job training than how well he performs subsequently 


on the job. This is probably because the training phase requires more 
abstract ability. In many cases the training proram Jeustves: eles 
roomlike procedures, reading of materi è 

operations. These are the kinds of thin 
best. Predictor tests usually correl: 


als, 


se, as a rule, perform- 
nan in the latter. Prog- 
Y- There is more of an 


ance is more reliably measured in the former tl 
ress in training is usually graded more careful] 


285 
Special Abilities 


opportunity to observe the worker, and, in many cases, tests are used 
to assess progress in training. 

Intelligence tests tend to be more predictive of success in high-skill 
rather than low-skill jobs. That is, validities are usually higher for 
such jobs as electrical technicians and complex machine operators 
than they are for jobs like truck drivers and furniture movers. The 
difference in validity is probably due to the increased importance of 
abstract ability in more highly skilled work. In selecting people for 
unskilled work, the problem is to set up minimum standards of intel- 
7 rather than to seek persons of high intelligence (see Table 

2). 


Table 13-2: Minimum Mental Ages for Several Jobs 
(From A. S. Beckman, 5) 


Mental 
age, 
years Boys Girls 
5 Dishwasher Sewer (simple patterns) 
Vegetable parer 
6 Mixer of cement Mangle operator 
Freight handler Crocheter (open mesh) 
7 Painter (rough work) Cross stitcher 
Shoe repairer (simple tasks) Hand-iron operator 
8 Haireutting and shaving Searf-loom operator 
Gardener Dressmaker (not including 
pattern work) 
9 Foot-power printing-press operator Fancy-basket maker 
Mattress and pillow maker Cook (simpler dishes) 
10 Sign painter Sweater-machine operator 
Painter (shellacking and varnishing) Launderer 
11 Storekeeper Librarian’s assistant 


Greenhouse attendant Power sealer in cannery 


Motor Dexterity. It has long been recognized that the person who 
Works well with his head does not necessarily work well with his hands. 
necomplishments like shaping a fine piece of pottery, hitting a home 
li i and operating complex machinery have little to do with intel- 

Sence or with formal school training. 

Pies the oldest motor tests are the pegboards, designed to 
asure arm, hand, and finger dexterity. A typical example is the 


286 
Prediction and Trait Measurement: Human Abilities 


Figure 13-2. Stromberg Dexterity Test. (Courtesy of The Psychological Cor- 
poration. ) 


Stromberg Dexterity Test (70; see Figure 13-2), The first part of the 
test requires the subject to place sixty cylindrical blocks into holes as 
fast as he can. In the second part, the blocks are removed, turned over, 
and put back in the holes. Another widely used test is the Crawford 
Small Parts Dexterity Test (20; see Figure 13-3). In the first part of 
the test, the subject uses tweezers to place pins in holes and then places 


Figure 13-3. Crawford Small Parts Dexterity Test. (Courtesy of The Psychologi- 
cal Corporation.) 


287 
Special Abilities 


a small collar over each pin. In the second part, small screws are put 
in place with a screwdriver. 

Some tests are designed specifically to test how well the individual 
can work with tools and small mechanical parts. A typical test of this 
kind is the Bennett Hand-tool Dexterity Test (7; see Figure 13-4). 
The test requires the subject to remove and replace nuts and bolts as 
quickly as possible. 


ay AX 


Figure 13-4, Bennett Hand-tool Dexterity Test. (Courtesy of The Psychological 
Corporation.) 


More complex tests involving hand, arm, and leg coordination have 
been designed for particular jobs. One of the best known of these is 
the Complex Coordination Test (56) used for the selection of pilots by 
the Air Force (see Figure 13-5). The test is a partial replica of an 
airplane cockpit, complete with stick and rudder. Lights on a control 
Panel simulate the maneuvers of an airplane. The subject must use 
stick and rudder to match the stimulus light, the counterpart in the 
test of coordinating stick and rudder as required by the situation in an 
airplane, 

Most tests of motor dexterity are highly dependent on speed. Conse- 
quently, they prove to be better predictors of jobs in which speed 
rather than quality is important. There are many jobs in which speed 


288 


Prediction and Trait Measurement: Human Abilities 


Figure 13-5. Complex Coordination Test. (Courtesy of the U.S. Air Force.) 


is only a minor consideration. The 
does not necessarily hay 

Motor dexterity tests 
80 and sometimes over 
is that they tend to corr 
ferent manipulations of 


person who can saw a board quickly 
ve the craftsmanship of a skilled cabinetmaker. 
have acceptably high reliabilities, usually wi 
90. An important characteristic of motor 1 
elate very little with one another. Slightly di 

the same material often have little in aes 
For example, the two parts of the Crawford Small Parts ao igs ine 
correlate on the average less than .50. A correlation of only 5 “on 
found between the two parts of the Stromberg Dexterity Test. Correla 


289 
Special Abilities 


tions between different motor tests prove to be even smaller. Factor- 
analytic studies of motor dexterity tests have generally found few 
broad common factors. Tests in this area are characterized by 
specificity. 

Because of the small overlap between motor dexterity tests, there 
are no general measures of motor ability such as the general intelligence 
tests supply for intellectual functions. Motor tests are then of relatively 
little use in vocational counseling. They are most legitimately used in 
industrial selection where the job is simple, requires a definite set of 
motor skills, and is highly dependent on speed. Among jobs of this 
kind are those in production line work, sewing machine operation, and 
packaging. 

Motor dexterity tests show at best only moderate predictive validity 
for most situations in which they are used. However, if they are used 
in conjunction with other ability tests, they often add a small, but 
important, increment to the over-all validity of the battery. Motor tests 
tend to be more valid when they are made to resemble the actual 
machine or instrument which is featured on the job. Tests designed in 
this way are called job miniatures. If the job is that of lathe operator 
in a machine shop, the best motor test would employ a miniature lathe 
with the same kinds of dials, handles, and controls that appear on the 
real lathe. During the Second World War, the Army Air Force used a 
variety of motor tests for the selection of pilot trainees. The Complex 
Coordination Test, which resembles most closely what the pilot actually 
does, generally proved to be one of the most valid instruments. 

Motor skills, and tests to measure them, are not very important to 
the average classroom teacher, Primarily this is because motor skills 
are almost totally unrelated to scholastic aptitude as evidenced in 
school topics such as reading, mathematics, and social studies, Motor 
skills are important mainly for certain specialized activities in schools, 
such as athletics, art, and some of the vocational courses in high school. 

Spatial and Perceptual Tests. & wide variety of mechanical work 
requires the spatial and perceptual factors which were described in 
Chapter 11. The automobile mechanic needs spatial orientation in his 
Work. In a typical job situation he is lying under the automobile and 
Must remove a nut from the engine above him. The nut is slanted at a 
45-degree angle, and he must remove it with a wrench that has two 
joints. The mechanic must orient himself spatially to the complex of 
angles and movements in order to do such work. In draftsmanship, it is 
necessary to portray three-dimensional objects on two-dimensional 
Pieces of paper. In some drawings the objects must be shown in tilted 
Positions or partially assembled. It takes spatial ability, both spatial 
Orientation and visualization, to work as a draftsman and at many other 
jobs, 


290 


Prediction and Trait Measurement: Human Abilities 


One of the best-known spatial tests for mechanical aptitude is n 
Minnesota Paper Form Board Test (51). It is a useful predictor o 
grades in shop courses, supervisors’ ratings of workmanship, objective 
production records, and many other measures of mechanical perform- 
ance (see Figure 13-6). 


, 5 or 
Figure 13-6, Sample items from the Revised Minnesota Paper Form Board. — 
each item, the subject must choose the figure which would result if the i 
in the first section were assembled, (Reproduced by permission of The Psyc 


t l 
logical Corporation. Copyright 1941, Rensis Likert and William H. Quasha. A 
rights reserved.) 


Perceptual ability is required in a variety of jobs. The perceptual 
speed factor has been used most often as a predictor, but it is likey 
that the other perceptual factors will eventually find their place = 
vocational guidance and job selection. The individual who sits by ts 
fast-moving conveyor belt and looks for flaws in manufactured 3 
uses perceptual ability. Any job in which it is necessary to aeran 
aspects of a visual scene requires perceptual ability to some ee 

Examples of perceptual tests can be found in some of the multifac od 
batteries, which were discussed in Chapter 11. Some tests design 0 
Specifically for industrial selection will be described in the section o 
clerical aptitude, of 

Mechanical Comprehension. Among the most successful tests A 
mechanical aptitude are those designed to measure the mastery 15 
mechanical principles, or the ability to reason with mechanical pro h 
lems. In a typical problem, a motorist must remove a boulder — 
blocks the road. He finds a long, stout pole to do the job. He mus 
then decide whether to use the pole as a pry or as a lever. He can 
construct a lever by balancing the pole on a rock placed between the 
boulder and himself (the lever exerts more force than a pry). After 
deciding to use the lever action with the rock as a balancer (a ful- 
crum), he must then decide where to place the balancer (rock) and 


291 
Special Abilities 


N As another example, a 
how to best exert his strength against the poe =e riek apo 
hoist is being built to lift tree trunks 11 i a me from an electric 
8 Tatna fe. wes 2 to transfer the n 5 
of gears and chains is set up scided how large the different gears 
motor to the hoist. It must be decidec hoist 
should be to give the desired power to 3 e 
The two problems above are typica range over several 
compréhenston tests. Tests of this kind T ate studies of 
ability factors, Because of the Panel s jian, 8 possible to say just 
mechanical comprehension tests, it is not : te hadie he sp atial 
What a particular test measures. Some of them ; 22 of mechanical 
factors, which are prominently —— T a ies tests are numerical 
aptitude, Other functions which appear in N familiarity 
computation, various aspects of ue Ap] comprehension 
8 . Ex les of mechanica j 
with tools and machinery. Examp nar t (8; see Figure 
tests are the Bennett Mechanical — 3 i T 155 
13.7), the Mechanical Reasoning . hon Test (65). 
Chapter 11), and the SRA Mechanical Ap 


Which room has more of an 
echo? 


Which would be the better 
sheors for cutting metal? 


le items from the Bennett Mechanical Comprehension Test, 
Figure 13-7. Sant iced by permission. Copyright 1940, The Psychological 
Form AA. (Repro k, N.Y. All rights reserved.) 


Corporation, 


al Information. One of the most useful measures for the 
Mecha? Jailed and semiskilled workers is 
selection o 


about tools and machinery, For ex 
knowl ed owing would be use 
like the 


mechanics: 


a test of information, or 


ample, a set of questions 
ful in the selection of automobile 


292 


Prediction and Trait Measurement: Human Abilities 


— 


What is a torque converter? 

What is a ratchet? 
3. Where is the “needle 
4. What source of 

automobiles? 


to 


valve” in an automobile? nit 
i A in m 
power is used to run the generator in T 


5. How do you recognize “preignition”? 
Information tests can be constructed either to measure general 
knowledge of mechanical work or to measure knowledge of one pa’ 
ticular job. It is usually the case that a test constructed specifically for 
one job, such as for television Tepairmen, will be more predictive = 
a test of knowledge in general about mechanical work, However, i 
test which is constructed specifically for one job is likely to be 1775 1 
only for selecting Personnel for that job or for closely related jo . 
Also, because of the specific knowledge which the instrument — 
it is usually of little use in vocational guidance, where it is usua 
necessary to measure broad functions rather than highly specialize¢ 
information. A more general measure of mechanical knowledge is often 
useful in vocational guidance, 

Analysis of Mechanical Aptitude Tests, . 
selection Program usually requires a careful study of the particular 
industrial setting. The diversity of Psychological ‘functions which a5 
required by different jobs makes it necessary to try out a range of tests 
to find the ones that will work well in practice, Also, it is often neces- 
sary to invent and construct tests for Particular jobs. 

Few of the mechanical aptitude tests have been studied as extensively 
as the tests of intellectual ability, Consequently, it is usually necessary 
to perform considerable research in the job setting to determine the 
utility of particular tests, In few cases have norms for the tests been 
obtained on a sufficiently representative sample to use them as 
dependable guides. It is generally more Meaningful to obtain local 
norms for particular personnel selection or school programs. 

Mechanical aptitude tests are fairly easy to standardize and make 
reliable, Reliabilities in many cases approach those of the better 
aptitude and achievement tests. Mechanical aptitude tests have modest 
validity for many different jobs, the amount varying considerably 
with the job. The seemingly small validities for some jobs should be 
regarded from a number of standpoints, Primarily there is the poss 
bility that formal abilities such as those Measured in psychological 
tests have little to do with job success, Another possibility is that the 
criterion of job success, the assessment, is unreliable and consequently 
cannot be predicted. If the assessment is determined only from the 
sketchy impressions of foremen and managers, it is seldom very relia- 
ble. Because of the unreliability inherent in most jo 


—_ 8 nel- 
A sufficient person 


b assessments, 


293 
Special Abilities 


mechanical aptitude tests are often considerably more valid than 
apparent from the correlation coefficients. The third point to consider 
is that the modest to low individual validities should not obscure the 
fact that a combination of several tests in a battery will often produce 
reasonably good predictive efficiency. 


Clerical and Stenographic Aptitudes 


Clerical Aptitude. Clerical aptitude, as it will be discussed here, 
concerns the office clerk, the individual who deals with files, ledgers, 
accounts, and correspondence. The term “clerk” is much more general 
than this particular usage, referring variously to grocery clerk, depart- 
ment store clerk, and even court clerk. Perceptual speed tests have a 
special importance in the prediction of clerical performance. Perceptual 
speed is involved in chores like proofreading letters, searching for 
particular accounts in a long list, and alphabetizing names. 

A typical test which emphasizes perceptual speed is the Minnesota 
Clerical Test (2). The test is divided into two separately timed parts, 
Number Comparison and Name Comparison (see Figure 13-8), Retest 


6 84 

384578 ——————. 527384578 

New York World 
Cargil Grain Co. 


66273894 


New York World ———— 
Cargill Grain Co. 


If the two names or numbers of a pair are exactly alike, make a check mark on 


the line between them. 
ms from the Minnesota Clerical Test. (Reproduced by per- 


Figure 13- le ite 
igure 13-8, Sample t The Psychological Corporation, New York, N.Y. All 


mission. Copyright 1933. 
rights reserved.) 


reliabilities range from 85 to .91. Extensive validation research shows 
correlations ranging up te .60 between the Minnesota Clerical Test and 
and job performance criteria, 

ced is a necessary component of many clerical jobs, but 
hould be tested as well. Verbal comprehension is a 
specially a knowledge of spelling and e 


business school 

Perceptual sp 
other functions $ 
desirable attribute, & : x 
Clerical work often involves sonene arithmetical operations, and there- 
fore, a numerical computation — is likely to be a useful selection 
dna TE the clerical job requires the use of accounting machines 
or other equipment, some of the motor dexterity tests may be of use. 
The DAT (Chapter 11) measures not only perceptual speed but a 
of verbal and arithmetic abilities; hence, it probably w id 
je selection of office clerks. The General Cleri ould 
battery designed to cover the ‘ 


variety 
serve well in tl 
(62) is a short 


Geni rical Test 
major functions required 


294 


Prediction and Trait Measurement: Human Abilities 


in clerical work. Scores on nine subtests are combined to form clerical, 
numerical, and verbal scores. de 

Stenographic Ability. The selection of stenographers is best mat Í 
on the basis of specific job requirements. Typing and shorthand ability 
are the major requirements in most stenographic jobs. Therefore, 
achievement tests in these skills offer a sound basis for the selection 
of stenographers (for typical tests see 11, 54). Here, as in many oiher 
testing problems, the needs of a selection program are not always 18 
same as those of the vocational guidance situation. In en 
guidance, before the individual has had an opportunity to test spe? 
in specialized training or on the job, some prediction must be mad : 
as to how well he will perform. Although an insufficient amount = 
research has been done on the aptitude for stenographic work to — 
us to speak with certainty, the most promising attributes seem o a 
the motor skills involved in typing, measures of verbal comprehensiot 
and language usage, and interest in stenographic work. 


Artistic Aptitudes 


The nature of art and artistic ability has been a matter of interest e 
psychologists for well over a hundred years, but in spite of this “ee 
interest, the measurement of artistic ability lags behind the testing pn 
other ability functions, This is due in part to practical consideration®. 
There has always been a more urgent need for intellectual and e 
tional tests than for tests of artistic ability. Research on — 
men in the Armed Forces, testing children in school, and selecting mE 
in industry has won financial support because of the immediate S 
to be expected. Although the study of artistic ability offers some Aa 
tical advantages, it never has promised a sufficient commercial mat 
to merit strenuous test construction efforts. zukrinsie 

Another reason that tests of artistic ability lag behind is the 3 
complexity of the functions to be measured. In this area, it ne 
difficult to distinguish aptitude from achievement. The * 
musician or painter can be judged by what he currently does. But a 
difficult to find the underlying aptitudes that give one child 8 
vantage over another in reaching eventual artistic accomplishment. inds 

Good art is largely a matter of time and place. Chinese music am * 
cacophonous to us, and no doubt much of our music seems atrangs “i 
them. Some primitive music centers almost entirely Gn. the aa 3 
other percussion instruments. Complex rhythmic patterns sia ey ion 
elusive for the “civilized” ear. We would miss the esti re y 
that the primitive has for his music as — he Pae bë 1 
the symphony orchestra. The adnate — epee a Japanese P les 
are lost on an occidental audience. We might cite many other examp 


295 
Special Abilities 


to show that art is a matter of values. Different people have different 
values, and values change over the ages. 

Different abilities are involved in the production of art and in the 
appreciation of art. The music critic may not be a musician at all; the 
art historian may have never painted. Different kinds of art work 
require different abilities. 

The measurement of artistic aptitude evolves into several com- 
ponents. For producing works of art there are probably some under- 
lying abilities that cut across different times and different cultures. In 
graphic art work, the ability to make line drawings, to combine colors, 
and to achieve properties of “balance” are required in most paintings. 
In musical ability there are the basic sensory skills of tonal memory, 
Sense of pitch, and recognition of rhythms, which to some extent cut 
across different kinds of musical production. 


Another attribute which can be tested is the appreciation of art 
lent on the values in a particular culture 


forms. Appreciation is depend 
and on the individual’s knowledge and acceptance of those values. 
how well the individual can produce 


Finally, tests can be made of I 
particular art forms; such achievement is dependent both on his initial 


aptitude and on the training that he has had. 


Musical Aptitude 

Seashore Measures. One of the oldest and most widely used musi- 
cal tests is the Seashore Measures of Musical Talents (67). The test 
stimuli are reproduced on phonograph records, which can be used for 
the testing of moderate-sized groups of subjects. The battery includes 


the following subtests: 

I. Pitch discrimination. The subject is asked whether the second 
of two tones is higher or lower than the first. The items are made 
progressively more difficult by decreasing the difference in pitch 
between the pairs of tones. Ti - ; 

2. Loudness discrimination. ne subject judges which of two tones 
is louder. — 0 2 

3. Time discrimination. One tone is presented for a longer period 
of time than another. The subject judges which of the two tones is 
longer. E 

Š „ment. The subject j 8 

4, Rhythm judg? ; subject judges whether two rhythmic 
patterns are the same or different. i 

a a 

> po judgment. The subj j $ 

5. Timbre [ues i 1 pject judges whether or not two ones 
are of the same musical qua ity. 
a 

pmory. Two series 8 1. 
6. Tonal me 7 ies of notes are played. In the se 


f the notes is altered. The subject juise cond 


orjes one O E I n 
series 0 s which of the notes 


is different- 


296 
Prediction and Trait Measurement: Human Abilities 


Scores on the subtests correlate near zero on the average with intel- 
ligence tests. The subtest scores are partly independent, with median 
intercorrelations ranging from .48 to .25 for different samples. Split-half 
reliability estimates for the subtests range from .62 to .SS, Rhythm and 
timbre are the least reliable. If, as is often done, the six subtests are 
added to form one general measure, high reliability can be expected 
for the test. Except for large differences in scores, the subtests are not 
sufficiently reliable for considering differential aptitudes within the test. 

Scores on the Seashore test are affected very little by age. Similar 
norms are found for elementary school, high school, and adult popula- 
tions. Although the research results are somewhat contradictory, it 
seems that scores are affected only slightly by musical training. These 
two findings taken together suggest that the Seashore subtests measure 
some basic aptitudinal functions which are possibly inherited. The 
larger question is whether the aptitudinal functions involved in the 
tests are of any importance in predicting musical accomplishment. 

An insufficient amount of research has been done with the Seashore 
test to speak with firmness about its predictive utility. Modest to small 
correlations have been found with grades in music classes and with 
teachers’ ratings of musical ability. The test differentiates moderately 
well between students who complete specialized musical training and 
those who drop out. It is reasonable to think that at the level of 
specialized music training most of the persons with poor ability in the 
Seashore type of measures will have already been eliminated. 

A number of persons have argued that the Seashore measures ms 
not very similar to the skills that are involved in the actual production 
of music. The Seashore subtests measure certain types of sensory 
discrimination which might be necessary for musical ability but not 
sufficient, Potentially, where the Seashore test would have its me 
important value would be in helping parents decide whether their 
children would profit from extensive musical training. This would save 
considerable money and would keep the neighbors from having fe 
hear little Susan grind away for years at an instrument she will never 
master. At present the Seashore test is difficult to administer below on 
age of ten, and the predictive validity of the test at younger ages IS 
not known, 

Wing Test. The Wing Standardized Tests of Musical Intelligence 
(85) were designed to stay as close as possible to the skills involved m 
musical production and appreciation. Like the Seashore test, the W Ing 
test uses phonograph recordings. The following seven functions are 
tested: 


1. Chord analysis. Judging the number of notes in a chord 
2. Pitch change Judging the direction of change of notes in a 
25 ge. 

repeated chord 


297 
Special Abilities 


3. Memory. Judging which note is changed in a repeated melodic 
phrase 

4. Rhythmic accent. Judging which performance of a musical 
phrase has the better rhythmic pattern 

5. Harmony. Judging which of two harmonies is better for a par- 
ticular melody 

6. Intensity. Judging which of two pieces has the more appropriate 
pattern of dynamics, or emphasis 

T. Phrasing. Judging which of two versions has the more appro- 
priate phrasing 

The first three subtests measure complex sensory abilities. The other 
four concern the esthetic value of different compositions. The subtest 
scores are added to form one general measure of musical aptitude. 

The Wing test has received favorable response from teachers of 
music, who feel that the test covers many of the skills that are im- 
portant in musical training. Little is known about how well the test 
can predict available criteria. The author reports correlations of .60 
and above between the test and teachers’ ratings of musical ability in 
three small groups. It is possible that the Wing test will prove to be a 
better differentiator of musical talent at higher levels of ability than 
the Seashore battery. The Wing test might then be useful in the 
guidance and selection of students who want to go on from some initial 
musical instruction to more advanced training. 

There are a number of other tests based on phonographically 
recorded tones and musical phrases. The Drake Musical Memory Test 
(22) emphasizes the memory component, which appears in only one 
subtest of the Seashore and Wing batteries. The Drake memory test is 
different in that the items concern short musical melodies instead of 
groups of tones. The subject must determine whether two melodies are 
the same, and if not, whether the change has been made in the key, the 
timing, or in specified notes. The Kwalwasser-Dykema Music Tests 
(49) comprise a battery of ten subtests. Six of the subtests are similar 
to those on the Seashore. The subtests cover much the same ground as 


the Seashore plus the ability to read musical notation and some com- 
appreciation, 


ponents of musical a 
Analysis of Musical Aptitude Tests. Not enough research has bee 
done to say how well the current tests work, A particular setter 15 
the dearth of adequate criteria of musical peace 5 
grades in the history, techniques, and general know ides i * l 5 
the most reliable indices. But these are not the same as ees 
musical production. Judgment of the actual mastery of 8 i 
ments must necessarily be based on ike eerie ne a 
other persons, and impressions of this kind usu | i 
reliability. Ever if there are some difficulties in 


artistry in 
cal instru- 
achers and 
ally have only modest 
validating the instru- 


298 
Prediction and Trait Measurement: Human Abilities 


ments, much more research should be done to determine how well 
they work. 

The tests which were discussed in the previous sections are all, 
strictly speaking, tests of appreciation. That is, the subject is not 
required actually to play an instrument but only to listen and judge 
what he hears. However, some of the complex judgments involved 
seem to underlie the skills that are needed in musical production, It is 
likely that other types of tests could be used in conjunction with the 
conventional measures to obtain a better estimate of musical ability. 
Motor skills are involved in playing most musical instruments, the 
piano being an outstanding example. Motor tests might be profitably 
used in the prediction of musical accomplishment. Although intel- 
ligence tests correlate very little with the available musical tests, this 
does not mean that they would be of no use in predicting musical 
accomplishment. It would be expected that intelligence and, mo 
generally, the factors which underlie differential aptitude tests, wone 
be useful in the prediction of course grades in musical curricula an 
in special music schools. 

It is likely that an individuals interest in musical work will be 5 
predictive of later success as tests of the ability type. Two such interest 
tests are the Farnsworth Scales (27) and the Seashore-Hevner Tests 
for Attitude toward Music (68). The small amount of research that 
has been done indicates some promise for tests of musical interest. 


Graphic Art 


McAdory Test. The field of graphic art testing has been dominated 
bya particular type of item, in which a masterpiece is compared wit . 
one or more altered versions of the same work. One of the oldest on. 
of this kind is the McAdory Art Test (53), which came out in 1929. 
The test contains pictures of seventy-two art works covering a wide 
variety of contemporary art forms, ranging from pictures of ee 
and automobiles to works of art in museums. Four versions of cani 
art work are given; these differ in shape, a ungement, shading, ant 
use of color. The subject is required to rank-order the four versions in 
terms of his preferences, 

Items for the McAdory test were selected in terms of the judgments 
of experts, including teachers, critics, and artists, Items were en 
only if at least 64 per cent of the judges agreed on the ranking of the 
four versions of each picture. A primary weakness of the test is 8 
dependence on contemporary art values, For example, it is likely 33 
the preferences for furniture, automobile designs, and even paintings 
have changed since the test was constructed. ; 

Meier Test. The Meicr Art Judgment Test (55) is by far the cur 
widely used test of art appreciation. It also uses the altered-versi¢ 


299 
Special Abilities 


type of item. The test differs from the McAdory in that only one 
alternative version is given for each original art work, and the items 
concern relatively timeless art masterpieces. The items are all in black 
and white. The altered version of each masterpiece is meant to destroy 
the esthetic organization. In a typical altered version, one figure is 
moved to the side in such a way as to change the balance of the 
painting (see Figure 13-9). 


Figure 13-9. Illustrative items from Meier Art Judgment Test. (Reproduced by 


Permission of Norman C. Meier.) 


The initial selection of items was made on the basis of expert judg- 
ments. Items on which there was high agreement among twenty-five 
experts were retained, The items were further pared down in terms 
of internal consistency statistics. Only those items showing a high 
correlation with the total score were placed in the final form, 

Split-half reliabilities for the Meier test range from .70 to .84 in 
relatively homogeneous groups of subjects. Scores correlate only 
negligibly with traditional measures of intelligence. Only a small 
amount of research has been done to determine how well the test 
Predicts available criteria. It has been shown that the test differentiates 
art students from nonart students and differentiates art students from 


300 


Prediction and Trait Measurement: Human Abilities 


one another in terms of the amount of training that they have had. A 
correlation of 46 was found with the grades of fifty art students. 
Correlations ranging from .40 to .69 were found with ratings of creative 
art talent. 

Graves Test. To remove the test as much as possible from tradi- 
tional and contemporary art values, the Graves Design Judgment Test 
(37) consists entirely of abstract designs (see Figure 13-10). Each test 


ay 
JE 


17 * 
». 


Figure 13-10. Illustrative items from Graves Design Judgment Test. (Courtesy 
of The Psychological Corporation.) 


item consists of either two or three versions of the same basic design. 
The altered version or versions were constructed to violate accepted 
esthetic principles. The judgments of art teachers and art students were 
used to select the best 90 items from an original list of 150. Split-half 
reliability estimates range from .81 to 93. Although the Graves test 
gives promise of being a useful measure, only a small amount of em- 
pirical work has been done with the instrument. 

Worksample Tests. A number of tesis were designed to test how 
well individuals can actually produce graphic art works. 101 

Typical of these is the Horn Art Aptitude Inventory (44), which 
includes the following subtests (see Figure 13-11): 

l. Scribble exercise. Making outline drawings of twenty simple 
objects 


301 
Special Abilities 


2. Doodle exercise. Making abstract compositions out of simple 
geometrical forms 

3. Imagery. Working from a given set of lines to a completed 
composition 
Other tests which are largely concerned with the production of art 
works are the Knauber Art Ability Test (47) and the Lewerenz Tests 
in Fundamental Abilities of Visual Art (50). 


Figure 13-11. Sample item from the Imagery Test of the Horn Art Aptitude 
Inventory. The subject is shown only the lines in rectangle (a) from which 
he is to make a drawing. Examples of completed drawings are shown in (b) 
and (c). (Adapted from Horn and Smith, 45, by permission of the American 


Psychological Association. ) 


The worksample tests must rely on the judgments of graders. Product 
scales are used, in which a particular drawing is compared with a 
standard set. Sample drawings are available for each score level. The 
grader gives a score in accordance with the apparent nearness in 
quality of the subjects drawings to the product scale examples. In 
spite of the apparent subjectivity of the scoring system, moderately 
high reliabilities are reported for tests of the worksample type. Current 
evidence indicates that the worksample tests predict course grades as 
well as the appreciation tests and do a better job of predicting teachers 
ratings of creative ability. 


302 
Prediction and Trait Measurement: Human Abilities 


Analysis of Graphic Art Tests. The difficulties of defining and 
measuring musical aptitude are magnified in the measurement of 
graphic art aptitude. Since the graphic arts are more dependent on 
fashion, criteria of accomplishment are weaker; hence the underlying 
aptitudes are more difficult to determine, Unlike the sensory discrimina- 
tion functions in musical aptitude, the current measures of graphic art 
aptitude appear to depend heavily on training. Consequently, they are 
of less value in the early guidance of prospective art students, where 
all tests of art aptitude would seem to have their most promising use. 

Current tests are biased toward certain cultural groups. For example, 
it was found that much lower scores on the McAdory test are made by 
Navajo Indian children than by children in New York City, in spite of 
the fact that the Navajo culture has a highly developed art form of its 
own. The available tests appear in most cases to be clever and well 
designed, but the paucity of research which characterizes the testing 
of artistic abilities leaves many questions about how well the tests 
work in practice. Perhaps future factor-analytic studies of graphic art 
tests will lead to a better knowledge of the underlying functions and 
how they can best be measured. 

Both musical and graphic art aptitude tests have a limited but 
important place in school testing programs. Rather than rely solely 
on such tests, it is better to let children try their hands at music and 
art, so that directly it can be seen how well they actually perform. 
However, such actual classroom performance is not entirely predictive 
of how well the child might perform if his interests were increased OF 
if he received more intensive training. Tests of musical and artistic 
aptitude can help fill that gap. 


S ummary 


In addition to the intellectual abilities which were discussed in the 
previous two chapters, there are many types of special abilities which 
interact with success in school and later success in vocational activities. 
The sensory abilities, particularly vision and audition, place limitations 
on how well students can perform in school. If children cannot see the 
chalk board or hear what the teacher says, they will perform poorly 
regardless of their intellectual potential. Teachers should be alert to 
possible sensory deficits in children, and tests of vision and audition 
should be periodically administered to all children. 

Motor skills have little to do with scholastic aptitude, and different 
motor skills tend to correlate very little with one another. The motor 
skills involved in typing are different from the motor skills involved in 
repairing watches, and both are different from the motor skills ie 
quired to fly airplanes. Motor skills are important primarily for certain 


303 
Special Abilities 


vocational activities, particularly for those that heavily depend on 
speed of response, e.g., typists and operators of complex machines. 

Mechanical aptitude concerns the making and repairing of appa- 
ratus, such as is involved in the work of the boat builder, airplane 
mechanic, and television repairman. Underlying mechanical aptitudes 
for a particular type of work are (a) at least a modest level of general 
intelligence, (b) motor skills relating to the particular activities, (c) 
perceptual and spatial abilities, (d) information about tools and 
machinery, and (e) above all, a strong interest in the particular line of 
work, Tests of mechanical aptitude are important only in the later 
years of high school when students are seeking advice about future 
vocational activities. 

Although all will agree that it is important to promote artistic 
abilities in children, it has proved very difficult to develop valid 
Predictor tests. This is partly because of the great complexity of 
artistic abilities and partly because artistic accomplishment is so much 
a matter of time and place. Tests for musical aptitude rely mainly on 
sensory abilities which relate to the memory, discrimination, and 
judgment of tones and chords. Such sensory abilities probably set a 
limit on how well students can perform in musical pursuits, but a 
high level of performance in such tests does not guarantee that stu- 
dents can reach a high level of musical virtuosity. 

Tests of graphic art aptitude are even less informative than tests of 
musical aptitude. This is because (at least in these times) accomplish- 
ment in graphic art is so difficult to define. Most tests of graphic art 
aptitude require students to differentiate “good” from “less good” 
Paintings and designs; others require students to produce line draw- 
ings. 

Although tests of musical and graphic art aptitude have a place in 
elementary and secondary school, they should not be relied on as the 
sole guide as to whether students will profit from intensive training in 
artistic pursuits. Rather, it is best to let all children try their hands at 
artistic pursuits, bring out some appreciation and skill in even the least 
apt student, and let students find out for themselves and show others 
which of them has the propensity for high-fever accomplishment. 


Suggested Additional Readings 

Anastasi, Anne, Psychological testing. (2nd ed.) New York: Macmillan, 1961, 
chaps. 14, 15. 

Cronbach, L, J. Essentials of psychological testing. (2nd ed.) New York: Harper 
& Row, 1960, chap. 11. 

Super, D. E. Appraising vocational fitness by means of psychological tests. New 


Vork: Harper & Row, 1949. 
Thorndike, R. L, and Hagen, Elizabeth. Measurement and evaluation in psychology 
and education. (2nd ed.) New York: Wiley, 1961, chap. 10. 


chapter 


Creativity 


A decade ago one of the shibboleths in education was dealing with 
“the whole child,” which was supplanted later with concerns over “why 
Johnny can’t read.” In these days much concern is being expressed over 
how to make Johnny more creative. 

In the previous three chapters we discussed the types of abilities that 
are essential to success in schoolwork and in later vocational success. 
But so far we have not discussed the abilities which are involved in 
truly creative work, as evidenced in the work of leading scientists, 
scholars, and artists. What types of children grow into creative adults? 
What types of instruments can be used to measure creativity? What 
can the teacher do to promote the creative potentials of all children? 
In spite of the obvious importance of these questions, firm answers are 
not yet available. Considerable theorizing and research have been 
done on creativity, and some of the results are quite promising, but 
much more needs to be done before we can speak with certainty 
about the circumstances that surround and promote human creativity. 
This chapter will discuss some of the factors which are currently 
thought to determine creativity.’ 

What Is Creativity? In spite of the widespread use of the term, 
people seldom stop to state what they mean by creativity. Implicit 15 
most discussions is the notion of creative products; that is, before it 
is meaningful to talk about the creativity of individuals, we must talk 
about the creativity of some of their productions. We first look at 
what the person has done (whatever it may be) and then judge the 
creativity of the accomplishment. Until some products are available to 
be judged, it is rather meaningless to argue about the creative abilities 
of particular individuals. 

‘Much of the discussion in this chapter is based on the four biannual reports, 


1955-1961, of C. W. Taylor, Research Conference on the Identification of Crea- 
tive Scientific Talent, University of Utah Press. 


304 


305 
Creativity 


If creative products are essential to judge creative ability, what 
types of products meet the standards. Sheer “goodness” is usually not 
enough to have a persons works labeled as creative. For example, most 
of us would not use the word creative to describe (a) a well performed 
surgical operation, (b) the solution of a difficult mathematical prob- 
lem, or (c) the construction of an excellent piece of furniture, In 
describing these accomplishments, we would more likely use terms 
such as “highly skilled” or “knowledgeable.” 

In essence, the word “creative” concerns the invention of something, 
the production of something that is new, rather than the accumulation 
of skills or the exercise of book-learned knowledge. Creativity concerns 
what people add to the store of knowledge which was on hand before 
they came upon the scene. Of course, the “inventions” of children more 
often than not are rediscoveries of existing pieces of knowledge; but 
if this knowledge is not available to them from books, class discussion, 
and other sources, such rediscoveries constitute genuine creative acts. 
Also, children sometimes are creative in the sense of actually adding 
to the existing store of knowledge. 

Conceivably, creativity could be manifested in anything one did; 
however, here we will be primarily concerned with creativity as mani- 
fested in (a) scientific research, including the social and biological as 
well as the physical sciences; (b) scholarly work, such as in philosophy 
and history; and (c) artistic productions. We are not sure that the 
traits that make for creative ability in one of these three areas would 
lead to success in the others, and it may be that different types of home 
and school environments are necessary to nurture creative people for 
the three areas, Current research evidence indicates that there is 
enough common ground among different types of creativity to talk 
about traits that make for creativity in general and environments that 


help promote creativity in general, 


Traits Relating to Creativity 
What are some of the characteristics that denote the creative stu- 
previously, it is necessary to witness 


dent? Because, as was said 
dge creativeness, it is difficult to judge 


Creative products in order to ju 
the creative ability of students, particularly that of students in the 
elementary grades. Few obviously creative products come from chil- 
dren, We may regard some of their products as clever or unusual, but 
seldom do they make a real imprint on the world. The line between 
i deed often quite fine. Consequently, to tell 
actually creative often results in considera- 
reativity is best judged by the tendency 
and on many occasions rather than by a 


the fool and the genius is in 
whether or not a student is 
ble guesswork, In children, c 
to be original in many ways 


306 


Prediction and Trait Measurement: Human Abilities 


few obviously creative productions of a kind that we associate with 
adult creativity, For example, the child who has creative potential as a 
scientist is not likely to manifest his creativity by actually producing 
an important new law of physics but rather by showing unusual insight 
into the implications of simple physical principles and by having many 
clever (for his age) ideas about how such principles relate to daily 
life. Similarly, the child who has creative potential as a writer is not 
likely to manifest his gift by writing a best-selling novel but rather by 
showing an unusual sensitivity to the meaning and use of words in 
themes and poems. 

In discussing the characteristics that often go along with creativity, 
it will be necessary to talk about a “type” of student. Of course, “types 
are only handy fictions that facilitate discussion. It is important to 
keep in mind that many creative students will be exceptions to the 
rule. Some will have none of the characteristics which will be dis- 
cussed, and the majority will “go in opposite directions” on at least 
one of them. The following characteristics are currently thought to 
typify creative students as a group. 

General Ability. How well do creative children score on tests of 
general ability, such as those discussed in Chapter 12. Some suggest, or 
apparently assume, that children who score only average, or even 
below average, may possess outstanding creative ability. All the evi- 
dence indicates that this is definitely a misconception. Although all 
children who perform well on tests of general ability are not creative, 
it is incorrect to leap from this observation to the conclusion that some 
children who do poorly on tests of general ability are quite creative. 
By any standard one chooses to apply, those children who are judged 
to be creative, as a group, score moderately high on tests of general 
ability. Most creative children are in the top 10 or 15 per cent on 
intelligence tests. Almost never does one find a child who is average, 
or below average, on tests of general ability whose products strongly 
indicate creative potential, 

The real question is why only some of the children with high IQs 
are creative. Rather than make a distinction between the intelligent 
and the creative, it is more appropriate to make a distinction between 
the intelligent but not creative and the intelligent and creative. Ap- 
parently, to be creative it is necessary to have a moderately high level 
of ability as represented by conventional tests of general ability, but 
beyond that point such tests are not indicative of creative potential. 

One of the difficulties in untangling the difference between intel- 
ligence and creativity, is that the so-called “intelligence tests” have 
usurped a name that has broad connotations. As was stated in Chapter 
12, tests of general ability relate to the understanding of subject 
matters; they do not necessarily relate to invention and discovery. 


307 
Creativity 


Success in School. Another misconception is that many creative 
children do only average or even quite poorly in schoolwork. The 
research evidence strongly suggests that this is not true. The studies 
show that those children who are judged to be creative, as a group, 
perform rather well in school. They may not make all A’s, but it would 
be quite rare to find a creative child who did not generally do at least 
B work. Although creative children are often bored by the routines of 
schoolwork and are distracted by their own special interests, they 
usually have sufficient energy and general ability to carry them to at 
least moderate success in the classroom. 

Introspectiveness. Creative children typically like to think, and 
they enjoy having some time to be alone with their thoughts. They are 
not always the happy extroverts which, for some reason, we tend to 
cultivate, They may appear absent minded and distracted; and, of all 
sins, they do not always listen when the teacher talks. Because of their 
introspectiveness, creative children are seldom voted “most liked” by 
their peers, and they may not enjoy the same social life in and out of 
school as do their less creative classmates. 

Adjustment. By ordinary standards (which perhaps we should 
revise) the creative child is often not as well adjusted as his noncrea- 
tive peers. The creative child tends to be different in many ways. His 
Searching mind goes far beyond that of his peers, and quite often 
beyond that of his teachers. In a sense, he knows too much—too much 
to take the “childishness” of his peers seriously, and too much to take 
Seriously all that the teacher says and does. The creative child is in 
much the same position that the college graduate would be in if he 
Were required to sit in a fourth-grade class and take it all seriously. 
He would be rather maladjusted in his surroundings. 

Perhaps creative children as a group would not appear somewhat 
Maladjusted if we were better able to recognize creativity and better 
able to nourish it when found. However, at the present time, by 
present standards, children who are thought to be creative tend to be, 
variously, shy, nervous, recalcitrant, and socially awkward. 

In addition to a tendency to be somewhat maladjusted, creative 
children tend to possess some other personality characteristics. One of 
these is that they are not strongly influenced by the values and stand- 
ards of others. They typically consider their own values to be best and 
will stick to them regardless of what others think. They often maintain 
a cynical view of what the teacher and the students value, which 
Serves to isolate them from the group. 

Another characteristic of many creative children is that they are 
very flexible in their ideas. They can change their minds quite readily, 
y need to have a “pat answer.” They are always 


Without feeling a strong 3 ; 
exploring new ways of looking at issues and are not very disturbed if 


308 
Prediction and Trait Measurement: Human Abilities 


one point of view proves to be faulty. In contrast, most intelligent but 
not creative children look for, and want, “pat answers” and are dis- 
turbed by finding that a seemingly well-established point of view must 
be abandoned. To the creative child, thinking is like a game of chess, 
in which the game itself is enjoyable, and long periods of contempla- 
tion before each move are savored. To the noncreative child, thinking 
is, at best, a means to an end. He is happiest when the issue is settled 
and done, and glad that he has to think no more. The noncreative 
child wants to learn “how to do it,” “what the facts are,” and “how the 
problems are to be solved.” The creative child has a less passive (and, 
in a sense, less disciplined) intellect. He dwells on what he considers 
to be intrinsically interesting, goes off on fascinating tangents, and 
soars above the mental level of the issue at hand. These are more 
reasons why he is different, and being so different he is likely to be 
labeled as odd, unfriendly, and troublesome. 

Creative children usually have a great deal of faith in their own 
ideas. Because they actually do more thinking, and do it better than 
the people around them, they soon learn to trust their own ideas. In 
the classroom this often appears to be unreasonable stubbornness. 

There is a type of teacher (rare, we hope) who is primarily con- 
cerned with convincing students of his vast store of knowledge. Such 
an inflexible fellow is disturbed by the creative child who thinks that 
his ideas are better than those of the teacher. This is why the creative 
child is seldom made teacher's pet. When asked to rate the likability 
of their students, teachers typically prefer intelligent but not creative 
children to intelligent and creative children. 

Home Environments. Creative children tend to come from unusual 
home environments, and they typically are unusual in the sense that, 
by common standards, they often are bad. Studies of highly creative 
adults indicate that, as a group, few of them came from happy, well- 
structured home environments. Many creative children come from 
broken homes or from homes where there is either constant strife 
between parents or the relationship is cold. 

Some of the other ways in which the home lives of creative children 
tend to be unusual are (a) having a mother who spends much time 
away from home in vocational or avocational pursuits, (b) being 
rejected by one or both parents, (c) having a father who is poorly 
adjusted as a man and as a family member, (d) frequently moving 
from city to city and/or from school to school, and (e) living with 
foster parents or with only one parent. 

Of course no one creative child has all the bad features of home 
environment mentioned above, and because we are talking about 
group trends, many creative children do come from stable, happy 


309 
Creativity 


homes. However, there is a tendency for there to be something un- 
savory about the home environment. 

There is apparently some truth in the saying that “genius is born 
of misery.” Hopefully, this is not necessarily so, but too many creative 
people come from unhappy beginnings to deny that the saying has 
some validity. Perhaps many creative children withdraw somewhat to 
escape the unpleasant features of their environments. They find 
pleasure in thought and fantasy that they do not find in their outside 
worlds. Thinking then becomes a habit which they ry with them 
all their lives. Conversely, it may be that many children who are in- 
telligent but not creative are absorbed by the pleasant features of their 
external worlds; therefore, there is no need for, and little enjoyment 
from, retreating into their own thoughts. If these things are so (and, 
admittedly, there is much conjecturing here), some way must be found 
for using methods of training to bring out the creative potential in all 
children rather than depending on “misery” to bring out the best in 
some of them. Some suggestions about how this can be done will be 


made later in this chapter. 


Measuring Creativity 

The measurement of creativity is still in its infancy. Consequently, 
the measures shown in this section are only illustrative of the efforts 
Which are currently being made. 

General Ability. It must be remembered that even though all the 
students who make relatively high scores on intelligence tests are not 
creative, the converse does not hold: almost none of the students who 
make low scores on tests of general ability will prove to be creative. 
Up toa point, then, tests of general ability are among the best pre- 
dictors of creative ability. The question is not whether creative stu- 
dents need relatively high IQs but rather, what they need in addition 
to high IQs. 

Personality Characteristics. Personality characteristics are among 
the most important determiners of creative ability. Some of the traits 
which apparently relate to creativity are (a) introversiveness, (b) 
flexibility of opinions, (c) intellectual self-confidence, (d) self-willed 
independence, and (e) immense energy for intellectual tasks. Some of 
the methods which are used to measure these personality characteristics 
are (a) teacher's observations and rating, (b) self-reports from stu- 
dents, and (c) projective tests. More about all these methods of at- 
tempting to measure personality characteristics will be said in Chapter 

6. Suffice it to say that none of these is as effective as we would like. 
Presently we have no truly satisfactory methods for measuring any 


310 


Prediction and Trait Measurement: Human Abilities 


personality characteristics, including those relating to creativity, Until 
such time as we do have excellent methods for measuring those per- 
sonality characteristics which relate to creativity, the approximate 
methods mentioned above will have to be used. 

Unusual Uses. Supposedly one aspect of creativity is the ability to 
see new and unusual uses for old objects and methods. This is il- 
lustrated by the pilot who, in a pioneering oceanic flight, thought of 
filling the wings of his plane with ping-pong balls to keep the craft 
afloat in case of engine failure over open water. Most of us noncrea- 
tive folk could look at a ping-pong ball for hours and not think of such 
a clever usage. Of course, one such clever idea does not make a person 
creative. The creative person is forever seeing clever, unusual uses 
for common objects and methods 

Test items can be composed to measure students’ abilities to see 
unusual uses for objects and methods. For example, students can be 
asked “What are some uses that can be made of empty tin cans?” Both 
the number and quality of answers are important. The noncreative 
student will think of “carry water,” “plant flowers,” “hold marbles,” 
and then be at a loss to provide more answers, The really creative 
child will produce a flood of answers including not only many of the 
ordinary uses mentioned above but such clever ones as “cut out the 
tops and bottoms, weld them together to make a stove chimney,” “put 
them in the ground to make golf cups,” and “cut holes in the bottom 
and use it to spread grass seeds.” Similar items can be composed 
relating to screwdrivers, paper clips, bottle caps, and many others. 

Consequences. One facet of creativity is the ability to see the 
many consequences that would follow from a particular action or event. 
For example, what would some of the effects be if the average tempera- 
ture of the earth were raised by 10 degrees? Some obvious conse- 
quences would be that less heating would be needed in homes, there 
would be less need for winter clothing, and people could swim most 
of the year. Some more remote (and perhaps creative) responses 
would be that the polar ice cap would melt and flood many coastal 
cities, Eskimos would have to drastically change their form of life, and 
many new regions could be opened up to farming, Many other such 
items can be composed to measure the ability to visualize conse- 
quences. 

Original Responses to Specific Events. One of the characteristics 
of many creative children is that they are often able to produce quite 
clever slogans, captions for cartoons, and endings to stories. In these 
instances creative children manifest the inventive side of verbal ability, 
which goes beyond the passive aspect of verbal ability that is tradi- 
tionally measured in tests of vocabulary and reading comprehension. 
Some illustrative items are as follows: 


311 
Creativity 


The student is asked to invent a clever title for the picture of a 
sleepy child standing near a worn out tire. Noncreative students would 
give titles like “Of to bed,” “Who is going to blow out the candle?” 
and “School in the morning.” The creative child is likely to think of 
something quite clever, such as “Time to retire” (item from Guilford, 
76). 

In another type of item students are asked to supply clever endings 
to a sentence or a short narrative. Following is an example. 


John walked through the snow and up the porch steps. After fumbling 
for the key and not finding it, he pushed against the door and found it 
was unlocked. Inside no one greeted him, and when he called “I'm 
home,” no one answered. With his eves fixed on the light coming from an 
upstairs bedroom, he slowly climbed the stairs. As he reached the top 
stair, he stopped suddenly and said “Oh, my goodness, 


Noncreative children would complete the story with endings like “I 
forgot to let the dog in the house,” “I meant to mail that letter,” or 
“There is a ghost.” More creative children would provide endings like 
“I took a bus home and left the family sitting in the car in front of my 
office,” or “I don't live in a two story house.” Many other such items 
can be constructed to test the creative ability to supply unusually 
clever verbal responses. 

Fluency. Apparently one aspect of creativity is the sheer fluency 
with which words, ideas, and solutions to problems are produced. One 
aspect of fluency was considered in Chapter 11: verbal fluency, which 
concerns the rapid production of words. An illustrative item is to ask 
students to produce as many words beginning with “s” as possible in 
two minute’s time. Although word fluency is only moderately well cor- 
related with measures of general ability, it apparently does go along to 
some extent with measures of creativity. 

Another type of fluency, one that apparently is related to creativity, 
is the ability to rapidly produce words in specific categories or that 
bear specified relationships to one another. An example of the former 
is to ask the student to quickly produce the names of objects that roll 
on wheels or the names of creatures that live in water. An example of 
the latter is to ask students to produce words that mean much the same 


as a given word, e.g., synonyms for “intelligent, such as smart, bright, 


and clever, 
Another aspect of fluency is ideational fluency. Creative children 
not only have better ideas, but they have many more of them as well. 


Ideational fluency can be measured by counting the number of ideas 


and solutions which students produce. This can be done with respect to 
some of the traits mentioned above: unusual uses, consequences, and 


312 


Prediction and Trait Measurement: Human Abilities 


original responses. Besides scoring for cleverness, simple counts can be 
made of the number of responses produced. 

In addition to scoring other types of creativity measures for the 
fluency of ideas, items can be constructed specifically for that purpose. 
An example is as follows: 


Imagine that you own a company which produces bicycles and that you 
want to make many improvements in your product. What changes would 
you make in bicycles? What would you do to make them better? During 
the next five minutes write down as many improvements as you can. Try 
to give good ideas, and give as many of them as you can. 


The number of different improvements listed would be one index of 
ideational fluency. 

Apparently, the various types of fluency are very important for some 
types of creativity. Creative people typically have floods of ideas, most 
of which are impractical, but a few of which are highly ingenious. 
Sometimes creative people are unable to evaluate the “good” and “bad” 
among their own ideas. This is why some creative people work better 
with a partner or as a member of a team. They are the “idea men” 
who must be supplemented by others who can carefully evaluate, 
experiment with, and test their productions. 

It is a mistake to judge the creative abilities of either adults or chil- 
dren by the number of unworkable ideas which they produce, but 
rather they should be judged by the number of ideas that do work. 
Because of the typical fluency of creative people, along with their good 
ideas, they are bound to produce many “whacky” ones as well. 

Perhaps one of the reasons why more people are not creative is that 
they are not willing to let themselves be fluent, not willing to let them- 
selves go mentally and produce a flood of ideas. Because many of these 
ideas are bound to appear silly to ourselves and others, we often stifle 
our thought processes rather than endure self-ridicule or the ridicule 
of others. To be creative, a flood of ideas must be produced, and the 
bad ones must be accepted as a natural part of the process. More 
about this will be said in a later section when we discuss what can be 
done in the classroom to promote creativity. 

Ingenious Solutions to Problems. Creativity concerns not only hav- 
ing many new and unusual ideas but also thinking of very clever ways 
to solve ordinary problems that occur in daily life. One such ingenious 
solution to a problem was shown long ago in the construction of a 
church in New Orleans. The water level in the ground was so high 
and the ground so soft that any ordinary foundation for the church 
would have soon collapsed. The solution: Hundreds of bales of cotton 
were buried in the mud, and the foundation was laid over these. The 
church to this day literally floats on bales of cotton. 


313 
Creativity 


Although it takes a skillful person, and takes him much time, test 
items can be composed to measure the ability to produce or recognize 
ingenious solutions to problems. An example is as follows: 


A truck is rushing medical supplies to a flooded town. Ten miles from 
the city, the truck driver discovers that his truck is about 1 inch too tall 
to go under a railroad overpass. There are no roads nearby that will al- 
low him to go around the overpass. Every minute is important. What 
should he do? 


The clever student will see that an excellent solution is to let some air 
out of the tires and then drive on. 

By its nature most people would think that creativity could not be 
measured with multiple-choice items. Creativity concerns the inven- 
tion of something. In contrast, multiple-choice items usually concern 
the recognition of correct answers. However, if the alternative re- 
sponses contain only a key word or some letters in key words in the 
solution, multiple-choice items actually can measure inventiveness. An 


example is as follows: 


A farmer living in a remote region finds that a 2-foot length of pipe 
has burst in the series of pipes that carries water from the pump to the 
house and barn. It is urgent that he get the water flowing again. He uses 
a wrench to remove the burst section of pipe. He looks in his tool shed 
and finds only one piece of pipe of the right length and diameter. On 
inspecting the piece of pipe he finds that both ends are threaded clock- 
wise, which means that the turning motion that would be required to 
screw the pipe in at one end would be the opposite of the direction of 
turning that would be required to screw the pipe in at the other end. If 
he turns the pipe clockwise with his wrench, the pipe will screw in at 
one end but not the other, and vice versa if he turns the pipe counter- 
clockwise. He has no other pipe and no special tools for rethreading pipe, 
nor does he have welding equipment or any material sufficiently strong 
to bind one end of the pipe. A key word involved in a temporary 


solution to the problem is 


a. frozen 

b. halfway 

c. bury É 

d. upside down 

e. burn 

The alternative answers give the student few clues, and, conse- 
quently, he must think of good solutions and see if any of the key 
terms apply. In the example above, only one term, ; halfway,” has 
been found to involve a good solution. The solution is to screw the 
pipe in tightly at one end, then, pressing together the unattached ends, 


314 
Prediction and Trait Measurement: Human Abilities 


unscrew halfway, which, in the process, will leave both ends halfway 
screwed into their respective attachments, The arrangement might 
leak slightly, but it would offer a clever temporary solution, better 
than any that can be obtained involving any of the other key terms. 

To provide even fewer cues to students, only the first letter of one 
or more key terms can be placed in the alternatives. An example is as 
follows: 


In a factory a hole has developed in a large steel container which is used 
to carry hot water from one vat to another. The container is part of an 
elaborate system of wheels and cables which is used to do the job. It 
may take several weeks to get a new container installed. The foreman 
tries to cover the hole with a steel disk, but because of vibration from 
the machinery, the disk keeps slipping away from the hole. Since the 
water is so hot, glue will not work, and no equipment is available to 
weld or bolt the disk over the hole. The disk can be held over the hole 
with a 


a. t 

bp 
ch 
Hs 


È Ma 


The best solution involves the letter “m.” A magnet placed on the bot- 
tom of the container under the hole would hold the steel disk firmly 
above. 

Creative Productions. One of the most straightforward, and in 
many ways the best, method of testing for creative potential is to give 
the student an issue or problem and ask him to produce creative re- 
sponses. This is essentially what is involved in parts of the Horn Apti- 
tude Inventory, which was discussed in Chapter 13. The student is 
given several lines on paper, and starting with these, he is asked to 
“create” something. Similar tests of creativity in the arts can be made 
using designs, parts of figures, and splotches of color. Starting from 
these bare beginnings, the student must create something on his own. 

Productions can also be used to test literary creativity. Students can 
be given the first two lines of a poem and asked to go on from that 
point to a finished product. Or students can be given the first paragraph 
of a story and asked to complete it. 

Productions can also be used to measure scientific and scholarly 
creativity. For example, a problem that could be used with high school 
students in science courses is as follows: 


Design and describe a vehicle for transporting people and suppli n 
the moon. Consider (a) the type of power supply needed, (b) the fue 


315 
Creativity 


that would be used, (c) the type of “wheels” that would be used, (d) 
special gadgets that would be needed for operating on the moon’s surface, 
and (e) any other properties that you consider relevant. 


A problem that could be used with students in the elementary grades 
is as follows: 


Suppose that we were going to build a new school and you were asked 
to design it. You would like for the school to be very modern and to con- 
tain many new ideas. What are some of the things that you would put in 


your school? 


Relatively noncreative students will give such answers as “pretty 
flowers,” “a better ball diamond,” and “more chairs in the lunch room.” 
More creative students will give answers such as “blinds run by motors 
that keep just the right amount of light coming in,” “sliding walls that 
let you make rooms bigger or smaller,” “little televisions on each desk 
where the teacher can show exercises, and it will tell you if you have 
the right answers,” and “tape recorders so you can talk into them and 
then see if you are pronouncing it right.” 

In using productions to test for creative ability, it is good if students 
are given plenty of time to respond. It would be best if students were 
allowed at least several days to write down their ideas. However, a 
problem that is often encountered in letting students wait so long to 
respond is that they will get their “creative” ideas from parents, older 
siblings, friends, or from books. Even if it often is not feasible to give 
several days’ time, students should be given as much time as possible 
in the classroom. For example, a fifth-grade student should be given 
at least one hour to respond to the question above relating to a new 
school. 

If it were not for one salient difficulty, the production methods 
would be used quite widely in measuring creative ability. The dif- 
ficulty is “How do you score the responses?” One part of the problem 
is that of finding scorers who can recognize creative answers when 
they see them. How can we score the productions of students who are 
more creative than we are? Actually this is not so much of a problem 
with students in the elementary grades. What would be creative re- 
sponses for most of them are not beyond the mental comprehension of 
most teachers. It does get to be a problem with high school and col- 
lege students, where some of the complex ideas creative students have 
are beyond the understanding of many teachers. Fortunately, it is not 
always necessary to be highly creative to recognize creative products. 
Even though we might not be clever enough to make the drawing, 
compose the poem, or design the school, we usually can recognize 
clever productions when we sce them. 


316 


Prediction and Trait Measurement: Human Abilities 


In most commercially distributed tests relating to productions, prod- 
uct scales are used for scoring. Expert judges score many different 
productions. From these, standard examples are chosen to represent 
different levels of performance. Productions of students are then 
scored by comparing them with the standard samples. This still in- 
volves an element of subjectivity, but carefully constructed product 
scales often have high reliability. 

Another problem that is encountered in the scoring of productions is 
the sheer labor involved. Scoring each is like scoring a difficult essay 
question. If it were not for the vast amount of time needed to score 
responses, production items would probably be used quite widely to 
measure creative ability. 

Perceptual Tests. In addition to the types of measures mentioned 
previously, it currently is thought that some types of perceptual meas- 
ures actually relate to creativity. We often talk about creative thinking 
as though there were some connection with perception. For example, 
in discussing creative processes, we talk about the ability to “see 
through” arguments, the ability to “focus” on important issues, and 
the ability not to be “distracted” by irrelevant cues. Not enough re- 
search has been done to know for sure whether creativity actually 
relates to perceptual processes, but some of the evidence is sufficiently 
interesting to encourage more research. 

One type of perceptual problem that is thought to be related to 
creativity is illustrated in Figure 14-1. In each of the two items, the 


TAF > N 
U BYS & 


Figure 14-1, Two items concerning the ability to detect embedded figures, (Re- 
produced by permission of Science Research Associates.) 


figure on the left is embedded in one or more of the complex figures on 
the right. In order to mark the correct figure, it is necessary to “see 
through” the maze of distracting lines and competing figures within 
the complex pattern. There is some evidence to suggest a relationship 
between this type of perceptual ability and the ability to see through 


317 
Creativity 


irrelevancies in scientific problems. This and other types of perceptual 
abilities may relate to some extent to creativity. 


Recognizing and Promoting Creativity in the Classroom 


5 As yet we have no standardized tests or batteries of tests for measur- 
ing creativity. Hopefully, we will have them during the next ten 
years, but until then, teachers will have to rely on their impressions 
and on their own instruments. Most of the types of measures discussed 
in the last section, e.g., measures of consequences and unusual uses, 
can be constructed and used by teachers. 

It is especially important that outstandingly creative students be rec- 
ognized at an early age and that creative thinking be encouraged in 
young children. Evidently, by the teens, or perhaps even much 
younger, the creative abilities are crystallized. Either they have it by 
then or they never will. Consequently, if anything is to be done in the 
school and home to encourage creative thinking and work (and this is 
a commonly held value), it should be started with first graders. The 
study of creativity is still too new for us to say with certainty what 
should be done, nor can we list specific techniques of instruction that 
are sure to work. There are, however, some general practices and 
Points of view which probably will help to bring out the best creative 
potentials of all students. 

Tolerate the Child Who Is Different. Admittedly it is difficult to 
change our attitudes, but unless we do, we may be rejecting many 
creative students. As was stated previously, creative children tend to 
be different, if for no other reason than because they are so insightful. 
We tend to favor conforming students who memorize what we tell 
them and what is in the text. We favor the happy-go-lucky, all- 
and because some creative students are shy, and 
they sometimes enjoy sitting and thinking, we are sure that something 
is wrong with them. Of course we all worry about, and want to do 
something about, the child who is highly withdrawn and is autistic to 
the point of mental illness. Except for these extremes, we should not 
contemplative children as necessarily “sick.” In 
a sense they may be better adjusted, or at least dedicated to a higher 
purpose, than the out going but vacuous “life of the party.” 

Tolerate New Points of View. It is strange that among teachers, 
who are supposedly dedicated to critical thought, there is sometimes a 
tendency to react negatively to students who say (even if they say it 
politely) “I think there is a better reason” or “In my opinion. 
As teachers we are all too prone to encourage students to depend on 
us to do the thinking, to give good grades and encouragement when 
students parrot our points of view, and to make it difficult for the 


A merican boy type 


consider introversive, 


318 


Prediction and Trait Measurement: Human Abilities 


student who has his own ideas. If we really want to promote creative 
thinking, we must learn to tolerate and encourage students to form 
their own opinions. 

Encourage Thinking. We do not do nearly enough to encourage 
students to think on their own, We spend much time convincing stu- 
dents that there is a “right” answer or method and much time having 
students memorize what other people think. Of course there is a lot 
that students must memorize as a foundation for later knowledge, and 
in many cases the ideas of students are not as good as those in the 
book or those the teacher holds. But in addition to insisting that stu- 
dents learn the fundamentals, we should do everything that we can 
to encourage them to think on their own, to criticize, to invent, and to 
look for better reasons and methods. 

The types of measures mentioned in the preceding section are use- 
ful, not only to index creative abilities, but also to provide guides re- 
garding how to construct exercises for promoting the creative abilities 
of all students, For example, a class problem can be focused on un- 
usual uses.” The teacher could instruct students as follows: “The prob- 
lem for this week is to get some good ideas for using coat hangers. 
Let's think all week about what clever things we can do with coat 
hangers, On Friday we will report our ideas“ Then each day of the 
week the teacher could maintain interest by reporting several in- 
genious uses that had been proposed for coat hangers. 

Students can be encouraged to think of the consequences that would 
follow from particular actions and events. One set of instructions would 
be as follows: “The number of people in our country is growing 
rapidly. The population may double in the next fifty years. What are 
some of the problems that this will cause? How will that change the 
way of life of people in our country? What can be done to handle the 
problems that will arise? Think about the problem for several days, and 
then you can write down your ideas.” As is true with all such “think” 
exercises, the teacher will have to prime the thinking processes of stu- 
dents by mentioning some ideas and problems of his own and sustain 
interest by bringing up several new ideas each day. i 

Exercises can be formulated which encourage students to think of 
ingenious solutions to problems, e.g., to put out forest fires, to help the 
blind move about, to translate foreign languages, and many others. 
Students can be encouraged to exercise their creative abilities to the 
full in their own productions—drawings, poems, short stories, essays, 
experiments, and theories. 

In these and many other ways students can be encouraged to be 
creative and enjoy doing it. The teacher should measure the success of 
each day, not only by “How much did I teach them?” but also by 
“How much did I encourage them to think for themselves?” 


319 
Creativity 


Sum mary 


Beyond the ability to master school topics, some students possess 
creative talents which, if properly nourished, will allow them to make 
important contributions to society. Unfortunately, at the present time 
We are only beginning to develop methods for measuring creative 
ability. Studies of creative students and adults suggest a number of 
characteristics relating to creativity, Creative people tend to score 
relatively well on intelligence tests, and they tend to make at least 
moderately good grades in school. Creative people tend to be differ- 
ent—they tend to come from unusual home and social environments, 
and they differ from other students in the ways in which they ap- 
proach intellectual problems. Some of the mental characteristics which 
currently are thought to distinguish creative students are (a) strong 
drive for intellectual accomplishment, (b) ability to see unusual 
aspects of problems and unusual solutions, (c) floods of ideas, and 
(d) ability to visualize the consequences of particular courses of 
action. 

Because creative students tend to be different and they are not al- 
Ways willing to agree with what other students and the teacher think, 
they are seldom highly popular. Consequently, it often takes a special 
effort on the part of teachers to appreciate the talents and energies of 
In addition to ensuring that all students obtain a 
achers can promote creativity in all 
actions, to seek new 


creative students. 
good grounding in core topics, te 
students by encouraging them to give critical re 
answers, and to think for themselves. 


part 


Prediction and Trait 
Measurement: Interests, 
Attitudes, and 


Personality 


Most of the discussion so far in this book has con- 
cerned the measurement of various types of ability: 
teacher-made tests, achievement tests, and aptitude 
tests, The major emphasis of the book has been pur- 
posefully set in that direction. However, there is 
another realm of measurement, one that is also very 
ant, one that has its own special methods and 


import 
ts of the “noncogni- 


problems. This other realm consi 

tive” attributes, which are sometimes lumped together 

under the overly general name of “personality.” 
distinction should be made between 


A primary 
measures of “maximum performance” and measures 


of “typical performance.” Most of the measures 
discussed in previous chapters deal with how well 
an individual can perform when he tries his best. 
This is the case, for example, with measures of 
spelling ability and arithmetic reasoning. In contrast, 
some of the measures which will be discussed in this 
section are not obviously concerned with “maximum 
performance” (or ability), but rather, with how the 
student usually acts. A test of courtesy offers a good 
example of why the distinction is necessary. A good 
test of courtesy should not mainly depend on knowl- 
edge of courtesy or on how courteous a student could 
be if he tried. Almost everyone knows how to be 
courteous, and people are differentiated in this respect 
by what they actually do in daily life. Consequently, 
jt would be a poor test of courtesy to ask a student 
questions about what constitutes courteous behavior; 
what is needed is some indication of the 


rather 


322 
Prediction and Trait Measurement: Interests, Attitudes, and Personality 


student's typical behavior in situations where courtesy 
is at stake, 

Many other examples could be given of traits that 
concern typical performance rather than maximum 
performance. For example, a measure of interest in 
music should concern what the student typically does, 
All students know how to act interested in music: 
attend concerts, read books on music, turn the radio 
to “good music,” ete. What is at stake is the student’s 
daily behavior with respect to music. Does he actually 
attend concerts, etc.? The same is true of many other 
measures. They concern what people typically do in 
daily life rather than how well they can perform in 
situations requiring maximum performance. 

Three major types of typical performance will be 
considered in this section: measures of interests, atti- 
tudes, and personality characteristics, Although these 
three kinds of measures will be more extensively 
spelled out in the following two chapters, it would be 
Wise to stop for a moment here and give partial defini- 
tions of each, 

There are no sharp dividing lines between the three 
types of measures to be discussed in this section, For 
some types of measures, it is difficult to decide whether 
they would be more appropriately placed in one cate- 
gory rather than another, e.g., interests or attitudes, 
However, because the three types of measures are 
used for somewhat different purposes and they tend to 
have different types of problems, it is important to 
distinguish them as well as possible, 

Interests concern preferences for particular types of 
activities. Thus, if a student says that he enjoys play- 
ing baseball, giving speeches, and repairing mechani- 
cal devices, these would be classified as interests. The 
main standardized measures of interests concern 
preferences for activities related to vocations, and, 
consequently, they are of major importance for the 
vocational counseling of high school students. Other 
measures of interests relate to activities in school, and 
are useful to teachers in structuring school activities. 

Attitudes concern how students feel about things 
external to themselves, Different measures of attitudes 
differ largely in terms of the “things” on which they 
focus. Some of the typical studies of attitudes concern 


323 
Prediction and Trait Measurement: Interests, Attitudes, and Personality 


feelings about (a) racial and ethnic groups, (b) world 
problems, (c) social behavior, and (d) classroom 
practices. 

The category of “personality” is somewhat more 
difficult to define. Principally, it refers to enduring 
traits of individuals which determine their social 
behavior. Thus the category applies to traits such as 
shyness, aggressiveness, moodiness, and friendliness. 

Before we move on to discuss particular measures 
of “typical performance,” a proper emotional tone 
should be set by admitting that measures of these 
kinds are not nearly as effective as we would like. The 
development of effective measures of ability, of all the 
kinds discussed in previous chapters, has been quite 
encouraging. We already have much proved “hard- 
ware,” and there is every indication that even better 
tests of ability can be constructed in the future, Un- 
fortunately, it is not possible to say the same about 
measures of typical performance. We do have some 
adequate measures of attitudes and interests, but our 
efforts to measure personality characteristics have, to 
date, been rather disappointing. Comparatively speak- 
ing, it is easy to measure number skills, vocabulary, 
and reasoning ability. In contrast, how do you measure 
aggressiveness or attitudes toward Negroes? Obviously, 
the complete answer is presently not known. Follow- 
ing are some of the approaches that traditionally have 
used in the attempt to measure interests, atti- 


been 
tudes, and personality characteristics: 

1. Self-report. By far the most widely used ap- 
proach has been to ask the individual what his 
interests, attitudes, and personality characteristics are. 
This is often done with many questions, and the ques- 
tions are carefully selected and combined on the basis 
of much research, but even so the basic nature of the 
data is not altered. Self-report is no worse, and no 


better, than what the individual actually knows about 


himself and is willing to relate. 

2. Observation. Another approach to the measure- 
ment of typical performance is through observation. 
For example, rather than rely on the student to faith- 
fully relate his own characteristics, a description can 
be obtained from the teacher. Although observation 


324 


Prediction and Trait Measurement: Interests, Attitudes, and Personality 


is useful in many situations, it often suffers because 
of the shortcomings of the observer, chief among 
which are (a) insufficient experience with students, 
(b) experience only in restricted circumstances, (c) 
wide differences in ability of observers, and (d) im- 
proper inferences about the meaning of what is 
observed. 

3. Projection. A third approach is to try to in- 
terpret the free productions of students in composing 
themes, describing situations, and responding to pic- 
tures. For example, if the themes of a student fre- 
quently concern family strife, it is reasonable to con- 
jecture that something is wrong in the student’s own 
home. In this situation it is said that the student is 
“projecting” his own problems into his themes. Many 
special instruments have been used for studying “pro- 
jection,” particularly in relation to the measurement of 
personality characteristics. The major difficulty with 
most presently used projective tests is that they are 
highly dependent on the intuition of the examiner. 
This not only requires the use of a highly trained 
tester, but it rests the validity of particular findings on 
the skills of the tester being used. 

4. Objective measures. Eventually it is hoped that 
interests, attitudes, and (particularly) personality 
attributes can be measured by objective tests which 
are similar in their properties to the objective meas- 
ures of ability currently in use. There is some evidence 
to indicate that some of the supposed measures of 
ability actually relate to personality characteristics. 
For example, as will be discussed more fully later, 
there is some indication that some measures of per- 
ceptual ability actually relate to personality. Numerous 
other suggestive findings have related objective test 
results to personality characteristics. These findings 
give promise that a truly objective study of typical 
behavior may be had in the years ahead. 


chapter 1 5 


Attitudes and 


Interests 


very much concerned with the attitudes and 
ne would deny it, teachers in- 
‘selling” certain points of 


Teachers necessarily are 
interests of their students. Although son 
evitably find themselves in the business of ` 
view, among which are patriotism, courtesy, sportsmanship, and a 
valuing of “good” art, scientific accomplishment, and scholarly produc- 
tions. Undoubtedly we are in favor of these and many other things, 
and we try to sell these points of view to our students. Also, teachers 
are influenced by the attitudes and interests of their students. Much 
of our cla oom activity is devoted to making topics more interesting. 
For this reason, we use films, contests, demonstrations—anything to 
excite the interests of students. If students value scholarly accomplish- 
ment and have positive attitudes toward education, it is relatively easy 
to start them on the road to becoming educated men; but it is very 
difficult if their attitudes in this respect are poor. In these and many 
other ways, teachers are concerned about the feelings of their stu- 
dents; and in order to deal effectively with these matters, it is im- 
portant to know how best to measure attitudes and interests. 


Interest Inventories 

rcially distributed interest inventories relate to 
sy are used primarily in the vocational guid- 
ance of high school students. Interest inventories also are used to 
measure preferences for activities in daily life. For example, an in- 
ventory can be used to study the preferences of students for different 
types of reading material or for different types of athletic activities. 
Inventories relating to activities in daily life can easily be constructed 
by teachers for the particular purposes at hand. First will be discussed 
Measures of vocational interests, and then some examples will be given 
of the measurement of interests in daily activities. 


The major comme 
Occupational interests. They 


325 


326 


Prediction and Trait Measurement: Interests, Attitudes, and Personality 


Interests were defined earlier as stated preferences for activities. As 
the word “stated” emphasizes, interest inventories depend on the 
individual’s honest and accurate reporting of what he likes to do. At 
the outset some justification needs to be given for using interest in- 
ventories. The common-sense approach to learning about interests 
would be simply to ask the individual what occupations he prefers. If 
the individual already knows that he wants io be a physician, sea 
captain, or fireman, it would be a waste of time to have him record his 
preferences on a printed form. The purpose in administering tests is to 
gain some new information about people. 

There is a considerable amount of evidence to show that stated 
preferences for occupations are unrealistic. This is particularly so 
among adolescents and young adults, with whom interest inventories 
are most needed. Young people usually are quite unaware of the 
specific activities which are entailed in different occupations. The 
individual’s stated preferences for occupations are often prompted by 
glamorized stereotypes. The physician is remembered as the heroic 
figure who performs the miraculous operation while the gallery looks 
down in silent awe. The sea captain is seen holding steadfast to the 
helm against the stormy onslaught of the sea. The fireman is pictured 
descending the ladder with the rescued maiden on his shoulder. All 
these images are of course very unrealistic. Few physicians do surgery 
at all. They must spend many hours in unheroic activities such as 
reading medical texts, writing reports, and calming the fears of anxious 
patients. The sea captain has scant opportunity to steer the ship 
because of the modern electronic gadgetry which automatically navi- 
gates. The captain usually is a sea-going businessman, ambassador to 
Passengers and clients, who must be concerned with such matters as 
bookkeeping, personnel management, and correspondence. No one 
considers what the fireman does in the larger portion of his time— 
tending equipment, collecting funds for charities, and helping rescue 
cats from inaccessible perches. 

The purpose of the interest inventory is to ask the individual about 
his preferences for a wide range of relatively specific activities such as 
mending a clock, preparing written reports, and talking to groups of 
people. From these a diagnosis is made of the occupations which most 
closely match the interests of the individual. A fundamental assump- 
tion in the use of interest inventories is that people in different occupa- 
tions have at least partially different interests, Otherwise there would 
be no way in which interest tests could be used successfully to advise 
people to consider one occupation rather than another, 

The Strong Interest Inventory. One of the earliest and still most 
widely used measures of interests is the Vocational Interest Blank 


327 
Attitudes and Interests 


(VIB) developed by E. K. Strong (72). Separate forms are available 
for men and women. The VIB employs 400 questions about relatively 
specific activities. On most of the items the student indicates his 
preferences by marking one of the three categories “like,” “indifferent,” 
“dislike.” Some illustrative items from the men’s form are as follows: 


Buying merchandise for a store L I 
Adjusting a carburetor E 
Interviewing men for a job L I 


88 


Responses to the VIB can be scored in terms of forty-seven occupa- 
tions on the men’s form and twenty-eight occupations on the women’s 
form, A separate scoring key is available for each occupation. The 
loped from the responses to the VIB made by 
ach of the occupational groups. Each scoring 
key is composed in such a way as to differentiate the people in a par- 
ticular profession from people in general. This procedure for develop- 
ing scoring keys is referred to as criterion keying. Each scoring key 
consists of a set of weights to be applied to the item responses. The 
Weights range from +4 to —4. A positive weight means that people 
in the profession, in accounting, say, mark “like” to the item more 
frequently than do people in general. A negative weight means that 
accountants mark the “like” category less frequently than people in 
general. The larger the difference between the profession and people 
in general, the larger is the weight. If an item does not differentiate a 
profession from people in general, it receives a zero weight. A con- 
siderable amount of research work was required to obtain the occupa- 
tional keys, and scoring keys for new occupations are gradually being 
developed. 

The responses of 
the professions. The 


scoring keys were deve 
successful persons in e 


an individual are scored on either some or all of 
scores can be converted to standard scores, per- 
centiles, or to a grading system ranging from A to C. The resulting 
profile of scores is used to interpret the individual's interests. People 
usually express high interest in a number of related professions such 
as mathematician, engineer, and chemist. 

The Kuder Interest Inventory. The Kuder Preference Record (48) 
the two most widely used interest inventories. 


and the Strong VIB are 
The two inventories present an interesting contrast in procedures of 
Instead of using the “like,” “indifferent,” “dislike” 
VIB, the Kuder inventory presents items in 
activities the one that he likes most 


test development. 
response categories of the 
triads. The subject picks from three 
and the one that he likes least. Two illustrative item triads are as 


follows: 


328 


Prediction and Trait Measurement: Interests, Attitudes, and Personality 


Visit an art gallery — 
Browse in a library 
Visit a museum 


Collect autographs — 
Collect coins — 
Collect butterflies — ae 


Instead of scoring the form in terms of numerous separate occupa- 
tions, scores are given in ten general areas: outdoor, mechanical, com- 
putational, scientific, persuasive, artistic, literary, musical, social serv- 
ice, and clerical. The ten categories were derived by item-analysis 
procedures, much the same as factor analysis would obtain. Unlike the 
VIB, the ten interest areas on the Kuder inventory were not related in 
any direct way with the responses of people in specific occupations. 
The interpretation of an interest profile from the Kuder inventory is 
largely dependent on the judgment of the counselor in regard to the 
interests that are involved in different occupations. 

Although the Strong and the Kuder inventories started out on dif- 
ferent tracks, recent developments have brought them closer together. 
A number of broad interest areas have been developed for the Strong 
VIB by factor-analytic studies, Scores on these can be obtained in 
addition to those for specific occupations. Research done since the 
Kuder inventory was published shows that the original logical method 
of analyzing interests leads to good predictions for certain occupations 
and poor predictions for others. Efforts now are being made to obtain 
equations for predicting how closely an individuals interests on the 
Kuder inventory match those of people in different occupations. How- 
ever, this work is not far enough along to provide the same empirical 
evidence for interpreting responses as is furnished by the VIB, 

Interests and Accomplishment. Because a person is interested in 
certain activities, such as those relating to engineering, it does not 
necessarily mean that he has the capacity for accomplishment in that 
field. The relationship between interests and ability is particularly 
tenuous in children and young adolescents. The child who professes 
an interest in athletic activities, for example, may have little athletic 
ability, and similarly for artistic and scientific pursuits. However, there 
is an increasing congruence between interests and ability as the indi- 
vidual matures. It is very difficult for a person to maintain an interest 
in activities in which he constantly performs poorly. As the child 
matures, his interests gradually shift to the things that he can do at 
least relatively well. 

Stability of Interests. Without some stability over time, scores on 
interest inventories would be of little use in advising people on voca- 
tional choices. Interests are notoriously unstable in children and 


329 
Attitudes and Interests 


adolescents. They begin to stabilize in the late teens and remain re- 
markably stable throughout adulthood. Strong (71) found retest corre- 
lations in the .70s and .80s over intervals as long as twenty-two years. 
This is both a credit to the VIB and strong evidence that interests are 
relatively enduring characteristics of human adults. 

Interest Inventories in Vocational Guidance. Interest inventories 
are second only to intelligence tests as aids to vocational guidance. 
Interests are, at least theoretically, very important to consider in 
choosing occupations. If an individual really likes a particular type 
of work, he often can succeed in spite of only a moderate amount of 
aptitude. No matter how much initial aptitude a person has, he can fail 
in a line of work through inattention and lack of effort. 

In vocational guidance, interest inventories are used for two related 
purposes: to predict satisfaction in the work and to predict successful 
performance. The criterion keying on the Strong inventory provides 
some supporting evidence that interest tests can predict future satis- 
faction on the job. Another type of evidence is that follow-up studies 
of individuals who completed the VIB in college show a strong 
tendency for people to enter occupations similar to their expressed 
interests. Both these pieces of evidence also tend to support the 
hypothesis that interests are predictive, at least to some extent, of job 
Strong has gathered more direct evidence to show that 
interest scores are predictive of performance in some occupations. For 
example, there is a relationship between the amount of interest shown 
on the key for insurance agents and the amount of insurance which 


agents sell. 


performance. 


Even though interests are, at least theoretically, very important to 
consider in choosing occupations, it does not necessarily follow that the 
available instruments are maximally effective measures of interests. As 
is true in most areas of testing, a great deal more research is needed. 

It is unfortunate that interest tests cannot be used as successfully 
in the selection of people for particular jobs as they can in vocational 
guidance, It has been shown repeatedly that interest tests can be faked 
to a marked extent. If people are told to mark the Kuder or the Strong 
inventory as a successful engineer or physician would, they will obtain 
profiles similar to the profession in question. 

People usually give honest responses in a vocational-guidance situa- 
are there for information and advice, and there is little to 


tion. They 
f ay or the other. If, as is 


gain by faking an interest inventory one W 
usually the case, the vocational-guidance facility is not connected with 
personnel-selection programs, there is no way in which test scores can 
lower the individuals chances of getting a particular job. When an 
individual applies for a job, he is seldom as desirous of learning about 
himself as he is of obtaining the position. If he is being interviewed for 


330 


Prediction and Trait Measurement: Interests, Attitudes, and Personality 


a job as an electrician, he knows that it behooves him to answer “yes” 
to an interest item like, “Do you like to repair electrical motors?” The 
small amount of success that interest inventories meet in personnel- 
selection programs should not mar the important place they have in 
vocational guidance. 

A Case History. How interest inventories are helpful in vocational 
guidance is illustrated by the case of Martin Batson. Since he was in 
elementary school, he had always assumed that some day he would be 
a physician in general practice like his father. Now that he is in the 
twelfth grade and considering what college to attend, he wonders 
whether he will like, and succeed in, premedical training in college. 
The high school counselor suggests that preparatory to discussing 
college training, it probably would prove helpful to obtain information 
from tests. The counselor already has available Martin's school grades 
and the results from achievement tests that periodically had been 
administered over the years. In addition, he has Martin take the Dif- 
ferential Aptitude Tests (discussed in Chapter 11) and the Kuder 
Preference Record, Vocational Form. The results from the interest 
inventory provided both the counselor and Martin with some helpful 
information. Martin’s raw scores on the “Kuder” were converted to a 
nine-point scale, with 9 representing the highest possible interest score, 
l representing the lowest possible interest score, and 5 representing an 
average interest score. Martin’s scaled scores were then compared with 
the scores of physicians and surgeons in general. The following results 
were obtained: 


Tnterest area Martin's scores Physicians’ scores 
Mechanical T 6 
Computational 7 4 
Scientific 9 7 
Per 4 3 

6 5 
6 5 
5 5 
Social service 4 5 
Clerical 3 3 


Two points stood out in comparing the two profiles. Martin had higher 
interests in scientific and computational areas than did physicians in 
general. Seeing this finding, the counselor asked Martin whether or 
not he had ever considered a career in one of the basic sciences, such 
as physics or chemistry, rather than in medical practice. Martin said 
that he had always enjoyed science courses very much and had thought 
that these interests could be satisfied in medical practice. 


331 


Attitudes and Interests 


A study of Martin’s grades and scores on achievement and ability 
tests provided other clues. His over-all ability and achievement were 
quite high, particularly in mathematical areas. His course grades and 
achievement test scores were particularly high in science topics. In 
discussing these facts, Martin began to wonder if he actually would 
be happy as a physician. The counselor gave Martin some literature 
to read about medical practice and about scientific careers. Also, he 
suggested that Martin contact professors at a local university to find 
out about the differences in requirements in the premedical curricula 
as opposed to the basic science curricula. After digesting this informa- 
tion and discussing the matter with family and friends, Martin came 
back to tell the counselor that he had decided to major in physics in 


college. 


Interests in Daily Activities 


Like measures of vocational interests, nearly all the measures of 
interests relating to daily activities are based on self-report. A number 
rilable to study self-report, which can be used in 


of methods are ava 
measures of interests as well as in measures of attitudes and per- 
d to measure interests in study 


sonality. These methods can be use 
topics, recreational activities, hobbies, and many others. With some 
practice, teachers will find that they can construct such measures for 
the particular purpose at hand. Some of the most prominently used 
methods are described as follows: 
Absolute Responses. One of the simplest techniques for studying 
self-report is to obtain “absolute responses” to a list of statements or 
activities, This can either take the form of ratings of agreement or 
disagreement with statements, or ratings of liking or disliking of activi- 
ties. An example is as follows: 


Dislike Like 


1. Reading in front of class 
2. Reading silently in class 
3. Reading my assignments at home — 


Instead of using only a two-point (like-dislike ) scale, a multipoint scale 


can be used, A sample scale is 


aD 


Always like 

: Usually like 

: Sometimes like 

: Sometimes dislike 
2: Usually dislike 
Always dislike 


5 
4 


3 


— 


332 


Prediction and Trait Measurement: Interests, Attitudes, and Personality 


Students either are asked to write in the appropriate number, or, better, 
they can mark the appropriate number as follows: 


A-d U-d S-d 84 U Ad 


1. Writing themes 1 2 t: 4 5 6 
2. Writing book reports 1 2 3 4 5 6 


Convenience dictates the wide use of absolute responses in self- 
report measures of interests, attitudes, and personality characteristics. 
It is rather easy to make a list of the things to be rated, and the 
absolute responses can be quickly made by students, Data obtained 
from the method also are relatively easy to analyze. To find the pre- 
dominant interests of a class or larger group of students, all that is 
necessary is to obtain the percentage of “like” responses, or, if a 
multipoint scale is used, the average rating of each item. 

In spite of the convenience and wide use of the absolute response 
technique, it has some potential faults. One fault is that the responses 
are affected by test-taking habits. Purely apart from the content in- 
volved, some students will mark “like” most of the time, and others 
will mark “dislike” most of the time. On multipoint scales some students 
will tend to mark toward the extremes of the scale, and others will 
mark mainly near the center of the scale. Such test-taking habits make 
the results somewhat difficult to interpret. 

Unless practical considerations require the use of absolute responses, 
it generally is better to employ one of the types of “relative responses. 
With relative responses, the student chooses one from a list of things, 
marks which of two activities he likes more, or ranks activities from 
the one liked most to the one liked least. Several methods for obtain- 
ing relative responses are described in the following sections. 

Multiple Choice. One of the simplest methods of obtaining relative 
responses is to have the student choose from a list the activity which 
he likes most. Examples are: 


1. I most like to read about 

a. scientific inventions 

b. animals 

c. explorers 

d. life in foreign countries 

My favorite class project would be 
a. keeping gold fish 

b. collecting rocks 

c. making a science exhibit 

d. working on a class newspaper 


333 
Attitudes and Interests 


Although the multiple-choice item often is quite useful for studying 
interests and other types of self-report, it also has several potential dis- 
advantages. One is that it gives no information about second, third, 
etc., choices. This is particularly disadvantageous when, as is often the 
case, there are many activities involved. If these are placed in one 
long list and the student marks only the one that he likes most, this 
provides only meager information about the student's interests. When 
the list is long, one compromise is to break it into a number of multiple- 
choice items with only four or five activities in each. The difficulty with 
this method is that the apparent preference for an activity is very 
much related to the other alternatives in the particular item. If, for 
example, “keeping goldfish” is included with three very popular activi- 
ties, it would appear to be very low in interest value. If it were in- 
ular activities, the results would make it seem 
ere highly popular. In order to learn more 
and in order to escape the difficulties en- 

a longer list into a number of multiple- 
ank the members of the 


cluded with three unpop' 
that “keeping goldfish” w 
about students’ interests, 
countered when breaking 
choice items, it is better to have students r 
list from “like most” to “like least.” 

Ranking. Interests can be measured by having students rank activi- 


ties from “most like” to “least like” as in the following example: 


Reading topic Rank 
a. Birds of North America — 
b. Exploring the Arctic — 
c. Life in Portugal — 
d. Wild West Stories — 
e. Planting a Garden — 
J. Murder at Midnight — 
9. Cartoon Comics — 
h. American Poetry — 
i. Lives of Great Men — 
J. How to Win Friends — 


tructed to look through the list, find the topic 
that sounds most appealing, and place a 1 in the appropriate space on 
the right; then write a 2 by the next most appealing, and so on until 
the appropriate number is written by the least-preferred topic. 

The dominant interests of a group of students can be determined by 
calculating the average rank given to each activity. For example, it 
might be found that adventure topics (e. g. Exploring the Arctic) are 
ranked high and that intellectual topics (e. g., Lives of Great Men) are 
ranked low. Such findings would not only help the teacher understand 
the present interests of students for different topics but would also 
indicate the directions in which reading habits might be improved. 


The students are ins 


334 
Prediction and Trait Measurement: Interests, Attitudes, and Personality 


A major caution in the use of ranking is that it is best to have no 
more than ten, or at most fifteen, “things” in the list, the number vary- 
ing with the age of the student. Very long lists require a great deal of 
time to rank, and students often get quite confused in the clerical 
problems involved. 

Pair Comparisons. Even more detailed relative responses than 
those obtained from multiple-choice or ranking techniques can be ob- 
tained from the method of pair comparisons. In this method, students 
are presented with all possible pairs of the things being studied and 
are required to mark the more preferred member of each pair. An 
example is as follows: 


Pair 1: Birds of North America Exploring the Arctic 
Pair 2: Birds of North America Life in Portugal 
Pair 3: Birds of North America Wild West Stories 


Only a part of the total list is shown above. The complete list for the 
ten reading topics shown previously would require a total of forty-five 
pairs. For each pair, the student marks either the space on the right or 
the space on the left to indicate his preference. The results can be 
analyzed by counting the number of marks for each topic in all the 
responses by one student, and then averaging these over all the stu- 
dents in a class or larger group. : 

The major advantage of the pair-comparisons technique is that it 
provides highly detailed and reliable results. However, because of the 
labor involved in administering, responding to, and scoring pair com- 
parisons, the method is not frequently employed by teachers. It is de- 
scribed here because it is sometimes used by educational and psycho- 
logical specialists in research investigations of interests and attitudes. 
Consequently, it is good for teachers to know something about the 
method in order to understand research results. 


Measurement of Attitudes 


Attitudes are predispositions to react negatively or positively in 
some degree toward a class of objects, ideas, institutions, or people. 8 
student is displaying an attitude toward Negroes when he says “I 
don’t want to be in an ‘integrated’ school.” A worker is displaying an 
attitude toward labor unions when he says “Our best friend is the 
union.” A high school student displays an attitude toward higher edu- 
cation when he says “College isn’t worth the effort.” Such feelings are 
important, and important to learn about, because, in large measure, 
they determine what people actually do. If students hold positive atti- 
tudes, e.g., toward higher education, it is relatively easy to lead them 


335 
Attitudes and Interests 


toward desirable goals. In contrast, if attitudes are negative, it is very 
difficult, In order to obtain the cooperation of students in moving 
toward socially approved goals, it is necessary to study their attitudes 
and, if they are bad, find ways to improve them. The three major 
methods for studying attitudes are (a) observation, (b) self-report, 
and (c) projective techniques. 

Observation. In some cases the attitudes of students are so obvious 
that no refined methods of measurement are needed. For example, if in 
a particular high school, many capable students make no effort to go 
to college, it is obvious that attitudes toward higher education are not 
as favorable as they should be. As another example, if all the students 
in a particular high school course grumble about the instructor, the 
negative attitudes toward the instructor are easy to see. 

The reason that observation is not always used to measure attitudes 
is that in many cases there is very little to observe. For example, be- 
fore students reach the point of making decisions about college, it 
would be very difficult to determine through observations attitudes 
toward higher education. Sometimes we are interested in studying at- 
titudes toward ideas or institutions for which there would be little if 
anything to directly observe. For example, it would not be possible to 
“observe” students’ attitudes toward the United Nations. Another 
it is not always possible to rely on observation is that we 
are often concerned with fine distinctions among attitudes of a kind 
that are too subtle to be manifested in overt behavior. For example, 
even among students who elect to go to college, there still are very 
important differences in attitudes toward higher education. For these 
reasons, self-report and projective techniques are used as a supple- 
ment to observation. 

Self-report. All the self-report techniques described with respect 
to the measurement of interests also can be used for the measurement 
of attitudes, Most self-report forms for the measurement of attitudes 
version of the “absolute response” techniques rather than 
ies. Interests are properly meas- 


reason why 


employ some 
one of the “relative response” techniqu 
ured in terms of relative responses because primarily we want to 
learn what activities interest students “more” and what activities in- 
Absolute response methods are preferable with atti- 
ary to learn how favorable students are, in 


articular attitudinal object. 


terest them “less. 
tudes because it is necess 
an absolute sense, toward the p 

The simplest, and in many ways the best, method of measuring atti- 
tudes through self-report is to present the student with a list of state- 
ous degrees of positive and negative feelings. 
his agreement or disagreement with each, 
or in the form of a rating scale. An 


ments embodying vari 
Ie is asked to indicate 
either as a dichotomous response 
example is as follows: 


336 
Prediction and Trait Measurement: Interests, Attitudes, and Personality 


1. SA: Strongly agree 

2. A: Agree 

3. AS: Agree slightly 

3. DS: Disagree slightly 

2. D: Disagree 

1. SD: Strongly disagree 

1. A college education is essential for SD D DS AS A SA 
any type of high-level job. lL 2 8 

2. A college education makes you a SD D DS AS A SA 
broader, more world-wise person. 1 2 8 4 5 6 

3. It is better to study on your own Sb D DS AS A SA 
rather than go to college. FF 

4. People can do very well in life with- SD D DS AS .A. SA 
out going to college. 1 2 3 4 5 6 

5. College is only for snobs who want to SD D DS AS A SA 
act like they are better than other 1 3 4 5 6 
people. 

6. College provides you with many new SD D DS AS A SA 
ideas and interests. 1 2 3 4 5 6 


In practice such a scale would probably contain at least ten statements. 
Numerous statements are needed for two reasons. First, to reduce the 
“chance” influence (measurement error), it is good to have quite @ 
number of items to “add over,” Second, it is necessary to have numer- 
ous statements so that, in addition to learning about over-all attitudes, 
it is possible to learn the particular ways in which students react nega- 
tively and positively. It would, for example, be important to learn 
whether a students over-all favorable attitudes toward college are 
due mainly to vocational aims or cultural enhancement. 

The simplest way to analyze the results of a scale such as that shown 
above is to add the numbers corresponding to the positions on the 
scale. Thus, if a student marks “strongly agree” to the first statement, 
this is added to the corresponding numbers for the marks made on 
other statements. For negative statements, the scale should be reversed 
before ratings are added. That is, “strongly agree” should be counted 1 
rather than 6, “agree” should be counted 2 rather than 5, and so on. 
After the negative statements are reversed in this wav, all statements 
can be added and then averaged. A sample result might be an average 
rating of 5.2 for one student, showing that his over-all attitude toward 
college is highly favorable. Š 

The use of statements, such as those shown above, provides a tech- 
nique for the measurement of many different kinds of attitudes. By 
composing the proper kinds of statements, scales can be constructed to 
measure attitudes toward educational goals, institutions, government 
policy, religions, ethnic and racial groups, and many others. 

Another method of measuring attitudes is with the Semantic Differ- 


337 


Attitudes and Interests 


ential (60). With this method, the “thing” being studied, e.g., college 
education, is rated with respect to sets of bipolar adjectives. An ex- 
ample is as follows: 


College education 
8 Valuable 


Worthless š š : : 

Wise : š $ š $ : Foolish 
Unpleasant 5 : > 2 2 5 Pleasant 
Good : 5 $ 3 : 3 Bad 
Friendly : $ : if : à Unfriendly 


Each pair of bipolar adjectives is called a scale. The “thing” to be 


rated is called the concept. The usual practice is to present each con- 


cept on a separate page and list the scales immediately beneath the 


concept. A typical study employs from ten to twenty scales and from 
eight to about fifteen concepts. 
Convenience is one of the great adv 
ential. Rather than study only one attitudinal object, e.g., college edu- 
to study a dozen or more attitudinal objects at the 
matter to compose the scales and have them 


antages of the Semantic Differ- 


cation, it is possible 
same time. It is a simple 
reproduced. Subjects can rate as many as twenty concepts on as many 
as twenty scales in less than an hour's time. 

There are two important types of analyses to be made of Semantic 
Differential results. First, it is useful to obtain over-all “favorableness” 
averages for each student. For this, a mark in the space on the extreme 
left is counted as 1, the first space to the right 2, the next space to the 
right 3, and so on to 7 for the space on the extreme right. If the more 
positive adjective is on the left rather than the right, the scoring is re- 
versed, e.g., 1 is given to a mark on the extreme right. The scores are 
then added and averaged over all scales in which one adjective is ob- 
viously more “favorable” than the other member of the pair. The result 
is an over-all favorableness rating. These average results would be ex- 
pected to correlate highly with the average results obtained from 
making agree-disagree responses to statements like those shown pre- 
viously. 

A second importan 
addition to obtaining average “favorableness” 
profiles. This is done by finding the average rating given by a group of 
students to each concept on each scale. The average results can then 
be plotted to form a profile, or “picture,” of the results. In Figure 15-1 
some illustrative results are shown from a study by the author (58). 
Two hundred members of the general public were asked to rate con- 
cepts concerning mental illness and, for purposes of comparison, con- 
cepts relating to normal people. The figure shows the profiles of aver- 
age ratings of three concepts: Neurotic man, Old man, and Me (self- 


t way to analyze Semantic Differential results in 
ratings is by obtaining 


338 


Prediction and Trait Measurement: Interests, Attitudes, and Personality 


Old man 
Neurotic man 


Bad — Good 
Worthless Valuable 
Dirty Clean 
Insincere Sincere 
Foolish Wise 
Ignorant Intelligent 
Dangerous Safe 
Sad Happy 
Poor Rich 


Unpredictable Predictable 
Tense fa 1 Reloxed 
Sick an Healthy 
Weak i i Strong 
Delicate ! ! A Rugged 
Possive___ ıı 2 Active 


Slow Fast 


Cold Warm 


Figure 15-1. Semantic Differential profiles for the concepts Me (—), Old man f- --) 
and Neurotic man (+++), Each point represents the mean rating of 200 subjects. 


rating). Comparisons of the profiles show many interesting differ- 
ences in attitudes toward the three concepts. 

A third important type of attitude scale is that developed by 
Bogardus (12). The scale is useful only for studying attitudes toward 
various national, racial, and ethnic groups. The scale items concem 
the desired social distance from the members of a particular group. 
A typical scale is as follows: 


Japanese 
. To close kinship by marriage 
Io my club as personal friends 
To my street as neighbors 
To employment in my occupation 
To citizenship in my country 
As visitors only in my country 
Would exclude from my country K — 


S SPP 


339 
Attitudes and Interests 


The student marks those items which correspond to his desired “near- 
ness” to the type of people, Japanese, in the example above. The 
social-distance scale should be useful to teachers in learning the feel- 
ings of students toward the different national, racial, and ethnic 
groups that they read about and discuss in class. 

Projective Techniques. The major use of projective techniques is 
for the measurement of personality characteristics, and, consequently, 
the topic will be discussed more fully in the next chapter. However, 
because projective techniques also are sometimes used to measure 
attitudes, brief mention will be made of how the techniques are used 
for that purpose. 

A projective technique essentially consists in a relatively unstruc- 
tured situation to which students can respond in a variety of ways. 
The way in which a student responds often is indicative of his atti- 
tudes toward the people, institutions, and ideas involved in the situa- 
tion. Projective techniques for the measurement of attitudes usually 
employ either pictures to be interpreted or stories to be completed. 
Examples of both of these are given as follows: 

Unstructured pictures could be used to measure attitudes toward 
Negroes. A picture shows a white man and a Negro standing on a 
street corner, The white man apparently is frowning as he says some- 
thing to the Negro. The student is shown the picture and asked to tell 
what is going on. The responses that students giv! often are quite in- 
dicative of their attitudes toward Negroes. Following is the type of 
response that would indicate a very negative attitude toward Negroes: 


The white man savs, “You stole my money, and I want it back.” The 
Negro says, “I ain't got your money, and I will kill you if vou say that I 
have.” He has a razor in his pocket. 
A much more favorable attitude toward Negroes would be indicated 
by the following response: 


The white man savs, “I wish these darn busses would come on time.” 
You must have been waiting a long time. My wife 


The Negro man says, 
will be here in a few minutes, and I will give vou a ride home.” 


Instead of using pictures, attitudes toward Negroes could be meas- 
ured with the use of incomplete stories. For example, students could 


be asked to complete the following: 


A white man and a Negro man are standing on the corner of a busy 


street. The white man turns, frowns, and says . . - 


A similar approach could be used to measure attitudes toward teachers. 


An example is: 


340 
Prediction and Trait Measurement: Interests, Attitudes, and Personality 


The teacher sees Johnny take a dime from Susan’s desk, She knows that 
the money does not belong to him. At lunch time, the teacher asks 
Johnny to wait until the class is gone. Then she says to Johnny... . 


A negative attitude toward teachers would be exemplified by a re- 
sponse like: 


“You are a dirty little thief. I am going to call the police, and they will 
fix you for this. I am going to tell all the children what you are.” 


A more positive attitude toward teachers would be exemplified by: 


“You must have wanted the dime very badly to take it from Susan’s 
desk. Is there some reason why you need the money? Tell me, and maybe 
I can help.” 


Several cautions should be heeded in using projective techniques to 
measure attitudes, The interpretation of projective techniques is more 
of an art than a science. There are no purely objective ways to score 
the results of most projective techniques. Rather, the “measure” de- 
pends very much on the intuition of the examiner. 

When used by experts, projective techniques often are able to un- 
cover attitudes that could not be detected in other ways. However, in 
order to do this requires much study and practice. Few teachers have 
the skills needed to make fine interpretations of projective techniques, 
and, consequently, they should be very cautious in using them, How- 
ever, in a sense, teachers cannot “avoid” using projective techniques, 
because most reports and themes are partly that. For example, if in 
many of the themes written by a particular student, “foreigners” are 
pictured as dirty, mean, and dishonest. the attitude is clear, Whether 
the teacher purposefully uses projective techniques or only picks up 
projected attitudes in themes, definite interpretations should be made 
only if the evidence is very strong, and only if it occurs repeatedly. 
Fine distinctions among attitudes and “deep” interpretations of atti- 
tudes with projective techniques should be made only by qualified 
psychologists and educational specialists. 

Attitudes in Class. One of the principal uses by teachers of atti- 
tude scales is to measure what students feel about a particular unit of 
instruction—the text, the subject matter, and the teacher. Although 
students are not infallible judges of the quality of particular units of 
instruction, they often provide the only source of evaluation. It is dif- 
ficult for school administrators and fellow teachers to evaluate how 
well a teacher is conducting his class. There are some bits of reliable 
information to help in making such evaluations, but too often impres- 
sions are formed on the basis of unreliable hearsay or a chance oc- 


341 
Attitudes and Interests 


currence, Besides the teacher, only students actually “live through” 
the unit of instruction, and only they have a realistic basis for judging 
the quality of the instruction. 

Apart from the difficulties that some have with clearly wording 
items, teachers should have little trouble in constructing scales to 
measure attitudes of students toward units of instruction. A sample set 
of instructions and a partial list of items are as follows: 


You are to rate your feelings about this course, including your feelings 
about the teacher, the text, the exercises, and the topic. The results will 
be helpful to your teacher only if you are as frank as possible. Give praise 
where you think it is deserved, and do not hesitate to be critical when 
you think something could be improved. With the following scale you 
are to rate this course in comparison to other courses you have had. 


Exceptionally good 
Above average 
Average 

Below average 
Exceptionally bad 


— 1 


If you feel that a particular aspect of the course is “below average,” 
place 2 in the space to the right. If you feel that another aspect of the 
course is “exceptionally good,” place 5 in the space to the right. In this 
way, write in the number corresponding to each item. 


1. The teacher’s knowledge of the topic — 
2. Interestingness of lectures — 
3. Teacher's attention to problems of particular students — 
4. Teacher's encouragement of students to think on their 

own 
5. Interestingness of the textbook 
6. Clearness of the textbook 
7. Subject matter coverage of the textbook — 
8. Interestingness of the topic no aes 2 
9. Importance of the topic for your over-all education — 
10. Content of tests — 
11. Fairness of grading = 
12. Over-all evaluation of your experience in this course — 


At the end of the list of items space should be provided for students 
to give additional comments. The simplest way to analyze the attitude 
ratings is to find the average rating by the class for each item. 
Several principles must be heeded in order to validly construct, ad- 
minister, and interpret student ratings of instruction. In order to ob- 
tain honest answers from students, frankness must be emphasized. It 
is difficult for teachers to be coldly objective about their own instruc- 
tion. and it is difficult for them not to show this concern to their stu- 


342 
Prediction and Trait Measurement: Interests, Attitudes, and Personality 


dents. The responses will actually represent students’ true feelings 
only if the attitude scale is administered in a matter-of-fact manner. 

In order to obtain cooperation from students, anonymity is essential. 
It is too much to expect that students will criticize their teacher if 
their names are clearly visible on the inventory. Not only is it essential 
that no names or identifying marks be placed on the inventory, but it 
is much better if the forms are marked and stacked on a desk while 
the teacher is out of the room. 

The first time that a teacher applies a scale of attitudes toward in- 
struction, he quite likely will be pleasantly surprised. Although it is 
sometimes hard to believe, students, as a group, tend to be overly 
kindhearted toward their teachers. At least they tend to be in making 
attitude ratings of instruction. Students, as a group, tend to make more 
“above average” than “below average” ratings of all their teachers and 
all their courses. Because of this “error of leniency” it is necessary to 
make certain comparative interpretations of results. Comparisons can 
be made by the teacher of ratings given in previous classes and/or in 
courses relating to different topics. If large differences are found be- 
tween these different sets of ratings, it provides hints about the 
quality of instruction, 

Some schools either require, or suggest, that all teachers regularly 
apply attitude scales relating to instruction (which tends to trauma- 
tize some teachers), The ratings can be collected and analyzed by 
some “neutral” party, someone not directly connected with the school 
administration, This way, norms can be obtained for the school as a 
whole and for different types of classes. After the results are analyzed, 
the teacher's ratings are returned, and by this method, his anonymity 
is protected. This gives the teacher both an opportunity to see the 
ratings made by his own students and to compare them with norms 
for the school, or school system, as a whole. Although it sometimes 
hurts our vanity a bit to do it, attitude ratings by students provide 
extremely useful information about how to improve our instruction. 


Sum mary 


Teachers necessarily are very much concerned with the attitudes and 
interests of their students, and, consequently, it is important to know 
something about related methods of measurement. In schools two 
types of interests are important: (a) interests in activities relating to 
vocational pursuits, and (b) interests in activities within the class- 
room. The former is important in the vocational counseling of high 
school students. The latter is important in helping teachers to struc- 
ture classroom exercises and projects. 

Interests are measured almost exclusively with self-report tech- 


343 


Attitudes and Interests 


niques, for which a number of methods are available. The major reason 
why measures of vocational interests are needed is that students sel- 
dom are aware of the specific skills and activities which different pro- 
fessions require. The available evidence indicates that interest inven- 
tories work well as aids to vocational counseling. Teachers can easily 
construct their own self-report instruments for measuring interests in 
classroom activities. 

Attitudes concern feelings about different types of people, ethics, 
and ways of life. Because much of education is intended to promote 
“healthy” attitudes, it is important to know how to measure attitudes. 
Although self-report techniques most frequently are used to measure 
attitudes, projective techniques and observational methods also can 
be employed, By employing these techniques, teachers can learn some- 
thing about the impact of school and home on students. Also, in reading 
about educational research, teachers will see frequent mention made 
of studies of students’ attitudes. 

One important use of attitude measurement is in obtaining reactions 
of students to particular units of instruction. Teachers can easily con- 
struct self-report forms for that purpose. Students (at least those at 
the high school level) tend to be fairly good judges of the caliber of 
instruction, texts, exercises, examinations, and other aspects of their 
school; and attitudes of students toward instruction provide teachers 
with many helpful hints about how to most effectively conduct their 


classes. 


chapter 16 


Measurement of 


Personality 


Mr. Martin tells a fellow teacher, “Jimmy Bartox has plenty of ability, 
but he makes low grades because of his disturbed personality.” The 
principal is considering shifting Mrs. Blum from the first to the seventh 
grade because he doubts that she has the personality required to deal 
with young children, Jack Madden is the best-liked student in the 
twelfth grade, because, as everyone says, he has such a good person- 
ality. In these and many other Ways personality attributes relate im- 
portantly to the joys and pains of everyday events in school. It need 
not be argued that educational decisions would be considerably im- 
proved if adequate measures of personality attributes were available. 

One of the impediments to discussing the measurement of person- 
ality is that there are so many different kinds of attributes that are 
included under the name, Some people even include interests and at- 
titudes in this category, which we chose to discuss separately. One 
use of the word is in saying that a person has a lot of personality, 
meaning that he possesses all the social graces, Personality is often de- 
fined negatively as the nonability or noncognitive functions. Any test 
that is not clearly a measure of aptitude or achievement is called a 
personality test. One often hears elegantly vague definitions such as 
“personality is the total functioning individual interacting with his en- 
vironment.” Retreating from this last definition, let us come down to 
earth and talk about some of the different kinds of attributes that 
properly can be called personality and are, at least potentially, sus- 
ceptible to measurement, 


l. Character. Character concerns the extent to which students ad- 
here to widely accepted standards of ethical and moral behavior. At 
one extreme are juvenile delinquents, and at the other extreme are 
students who are admired for their personal integrity. Some traits 
which are involved in character are honesty, sportsmanship, polite- 


344 


345 


Measurement of Personality 


ness, considerateness of others, and respect for widely held social values. 

2. Social traits. This category concerns the characteristic behavior 
of an individual with respect to other people, excluding those traits 
that would more properly be subsumed under character. Typical social 
traits are shyness, moodiness, humor, talkativeness, and dominance. 
This is the category of attributes that is most frequently called per- 
sonality, Of course social traits are not entirely independent of char- 
acter, However, they are far from perfectly related. For example, a 
dishonest student might be either shy or aggressive, hyperactive or 
placid, or friendly or hostile. 

3. Situational adjustment. This category refers to the realistically 
good and bad aspects of a student’s home and school environment. It 
should be understood that some people are maladjusted, at least tem- 
porarily, by unsavory aspects of their current environments rather than 
because of more enduring aspects of their “permanent” personalities. 
For example, if a student comes from a home which is fraught with 
hostility and discord, he may appear quite unhappy and maladjusted. 
As another example, if a student is receiving the brunt of a teacher's 
own frustrations and embitterment, it is only to be expected that the 
student will be maladjusted in that situation, Some other factors which 
relate to situational adjustment are poverty, death in the family, fre- 
quent moves, crowded schools, unfit teachers, and physical disability. 

4. Mental illness. If a student is extremely disturbed personally 
and/or if he is extremely disturbing to others, he is said to be mentally 
ill. For example, within limits shyness is a normal social trait, but 
when it gets to the point that a student will hide in a closet for fear of 
others, it is said to be mental illness. Mental illness should not be 
taken to subsume some mysterious “disease in the mind.” At the pres- 
ent time we are not sure how mental illness gets started or what keeps 
it going. For the discussion in this chapter, it will be sufficient to think 
of mental illness as referring only to behaviors which are so extreme that 
everyone feels that “something needs to be done about the student.” 


Most of the instruments to be discussed in this chapter are not fre- 
quently employed by classroom teachers; rather, they often are used 
by school psychologists in studying problem students, and they are 
used by specialists in research on the educational process. The three 
major methods which are currently being used to measure personality 
are (a) self-report, (b) projective techniques, and (e) observational 
methods. Each of these will be discussed in turn. 


Self-report Methods 


Self-report is one of the basic tools in the diagnosis of illness. The 
physician asks, “Where does it hurt?” “How is your appetite?” The 


346 


Prediction and Trait Measurement: Interests, Attitudes, and Personality 


patient volunteers information like, “I don’t sleep so well,” and “Tam 
all out of pep.” The self-report technique has carried over into py. 
chiatry and clinical psychology. The questions are different, but the 
method is the same. The psychiatrist asks the patient whether he has 
nightmares or feels uncomfortable in a crowd; he inquires how = 
patient gets along with parents and friends. Over the years it ge! be 
come apparent that certain questions are more successful than ot a 
in detecting maladjustment. These have become almost standard for 
all interviews, used along with questions that have particular rele- 
vance to a case. It is an easy jump from a list of standard questions to 
a test: all that is necessary is to write the questions down and have 
the subject indicate his agreement or disagreement with each, This is 
exactly how the first personality inventories were developed. 

The first self-description inventory that achieved prominence ee 
Woodworth’s Personal Data Sheet (86). The inventory was developed 
as a means of weeding out emotionally unstable persons from 1 
United States Army. A standard, time-saving procedure was needed 
because of the shortage of trained interviewers. The inventory contains 
116 questions concerning neurotic tendencies, ten of which are as 
follows: 


Are you troubled with dreams about vour work? 

Do you often have the feeling of suffocating? 

Have you ever had fits of dizziness? 

Did you ever have convulsions? 

Did you have a happy childhood? 

Have vou ever seen a vision? 

Did you ever have a strong desire to commit suicide? 

Can you stand the sight of blood? te 
Are you troubled by the idea that people are watching vou on the street! 
Does it make you uneasy to sit in a small room with the door shut? 


The questions were obtained from a search of the psychiatric litera 
ture and from conferences with psychiatrists. A neurotic-tendency 
score is obtained for each person by adding the number of “neurotic 
responses. xe 

The personal data sheet was not considered to be a test in the smig 
sense of the word. Persons who gave more than thirty or torty 
“neurotic” responses were brought in for detailed psychiatric inter- 
views. Although little direct evidence for validity was obtained, Ped 
ple who worked with the personal data sheet during the First Wor 3 
War were generally satisfied with the inventory as an aid to 175 
chiatric screening. After the First World War an interest ee 4 
the construction of tests of all kinds. personality inventories included. 


347 


Measurement of Personality 


Most of the inventories were modeled directly after the personal data 
sheet, to the extent of using many of the same items. 

Problem Checklists. A widely used type of personality inventory 
for elementary and high school students is the problem checklist. 
Such inventories present long lists of typical problems at school, in the 
home, and among friends. Students either mark those problems that 
apply to them or they rate the extent to which each is a problem. A 
typical checklist is the SRA Junior Inventory (63) which samples 
problems in five areas designated as about me and my school, about 
me and my home, about myself, getting along with other people, and 
things in general. Some items from the inventory are: 


I want to learn how to read better. 

I wish I had more “pep.” 

I wish my parents would not be so strict. 
I am too nervous. 


For each item students rate whether it represents a “big,” “middle- 
sized,” “little,” or “no” problem. The inventory is standardized for 
children in grades 4 to 8. A companion checklist (64) is available for 
students in grades 7 to 12. 

The SRA Junior Inventory, and most other checklists, provide total 
area scores, e.g., for the home. However, rather than depend heavily 
on such total scores, it is wiser to look at the particular problems 
which each student checks. These will provide many hints about the 
problem areas for each student. 

A short case history will help to show how the results of a problem 
checklist provided valuable information about a seventh-grade stu- 
dent. A guidance counselor administers one of the problem checklists 
to all students as part of a school-wide testing program. Wade Martin 
checks many more problems than does the average student. In study- 
ing his responses, the guidance counselor sees that Wade has indi- 
cated many problems in the home and particularly with respect to his 
parents. The guidance counselor talks with Wade to see if he actually 
has a burden of personal problems. Wade seems depressed, and he has 
many unkind things to say about his “new mother.” The guidance 
counselor has a talk with Wade's parents, which confirms the prob- 
lems indicated on the checklist. Two years after the death of Wade's 
mother, the father remarried, and the new wife is much younger and 
different in many other ways from Wade's mother. The parents say 
that Wade always has been a sensitive boy and that the remarriage 
had a very depressing effect on him. The parents are eager to do 


whatever they can to help, but they have had very little success in 


348 


Prediction and Trait Measurement: Interests, Attitudes, and Personality 


dealing with Wade. The guidance counselor suggests that a nearby 
psychological clinic be contacted, where counseling can be obtained 
for both Wade and the parents. 

Minnesota Multiphasic Personality Inventory (MMPI). A self- 
description inventory which frequently is employed by guidance 
counselors and school psychologists in dealing with problem students 
at the high school level is the Minnesota Multiphasic Personality In- 
ventory (41). The MMPI represents the apex of research and detailed 
test construction in the area of adjustment inventories. Research on 
the instrument has gone on for twenty years now, and hundreds of 
journal articles have been devoted to its construction, refinement, and 
use. The MMPI is intended to measure the relative presence or ab- 
sence of eight forms of mental illness, five of which are listed below. 
Two related items are shown for each type of mental illness. A plus 
sign means that persons who have the illness are likely to agree with 
the item; a negative sign means that they are likely to disagree. 

Hypochondriasis (Hs). Overconcern with body functions and 
imagined illness, 


Related items: 
I do not tire quickly. (-) 


The top of my head sometimes feels tender. (+) 


Depression (D). This is used in the conventional sense to imply 
strong feelings of blueness, despondence, and worthlessness. 


Related items: 


Tam easily awakened by noise. (+) 


I sometimes keep on at a thing until others lose their patience with 
me. (+) 


Hysteria (Hy). The development of physical disorders such as 
blindness, paralysis, and vomiting as an escape from emotional prob- 
lems. 


Related items: 
I am likely not to speak to people until they speak to me. (+) 


I get mad easily and then get over it soon. (+) 


Psychopathic deviate (Pd). An individual who lacks “conscience,” 
who has little regard for the feelings of others, and who gets into 
trouble frequently. 


349 
Measurement of Personality 


Related items: 
My family does not like the work Ihave chosen. (+) 


What others think of me does not bother me. (+) 


Paranoia (Pa). Extreme suspiciousness to the point of imagining 
elaborate plots. 


Related items: 
I am sure I am being talked about. (+) 


Someone has control over my mind. (+) 


The results of the MMPI can be plotted as a profile showing scores 
on the scales! (see Figure 16-1 for illustrative profiles). It is seldom 


Hs D Hy Pd Mf Pa Pt Sc Ma 
120 | ] | 


| 
a T 7 5 T 
— 


ues | 


Hs D Hy Pd Mf Pa Pt Sc Ma 


T score 


Scale 


Figure 16-1. MMPI profiles for a normal adult (---) and for a “typical psy- 
chotic” (—). (Adapted from Gough, 34, pp. 554 and 563; reproduced by per- 


mission of The Ronald Press Company.) 


found that a person scores high (the maladjusted direction) on only 
one of the scales. Typically, the maladjustment spreads across several 
of the scales. Some of the scales correlate substantially with the others, 
which is to be expected because mental illness seldom occurs as one 
ern of traits. It is usually the case that the patient has a 


specific patt | 
ifferent kinds of mental illness. 


mixture of d 

1 The other three scales for mental illness are Psychasthenia (Pt, strong fears), 
Schizophrenia (Sc, bizarre thoughts), and Hypomania (Ma, overactivity). A 
ninth ae is included to measure Masculinity-Femininity (Mf), the balance of 


male versus female interests. 


350 


Prediction and Trait Measurement: Interests, Attitudes, and Personality 


In order to interpret the MMPI profile, complex pattern scoring 
methods have been devised. Even with these it is necessary to have an 
experienced clinical psychologist to interpret the results. As is true of 
many clinical methods, a complex lore has developed about the mean- 
ing of different kinds of MMPI profiles, some of which has only slight 
grounding in empirical fact. 

The MMPI proved useful in diagnosing the problems of Ellen Cart- 
wright. Although she had always been thought of as a problem stu- 
dent, in the eleventh grade her behavior became so deviant as to re- 
quire immediate attention. She broke every rule in school, dressed 
oddly, and had many arguments with other students. The last straw 
was when she openly lit a cigarette in English class. The MMPI 
showed an extremely sick pattern, with particularly high scores on the 
psychopathic-deviate and paranoia scales. The school psychologist 
urged the parents to consult a psychiatrist. Ellen missed a year from 
school while she was under treatment. After returning to school, her 
behavior grew worse, and again she was removed from school. After 
another year at home, her behavior became so bizarre that it was 
necessary to place her in a mental hospital. 

Multifactor Batteries. An extensive effort has been under way to 
chart the major factors arising from self-description inventories. The 
hope is to find a limited number of traits that account for the scores 
obtained from diverse inventories, A series of studies by Guilford and 
his associates laid the groundwork in this field. He first collected thirty- 
five statements which were purported by different authorities to be 
primary aspects of introversion-extraversion. These were made into an 
inventory and administered to groups of subjects. A factor analysis of 
the items produced five factors instead of the single continuum along 
which introversion-extraversion is commonly judged. Successive factor 
analyses were performed on new sets of items until thirteen factors 
were found in all. The ten most prominent factors appear in the 
Guilford-Zimmerman Temperament Survey (39). The factor names 
and descriptions are as follows: 


G, General activity. High energy, quickness of action, liking for 
speed, and efficiency 

R, Restraint. Deliberate, serious-minded, persistent 

A, Ascendance. Leadership, initiative, persuasiveness 

S, Sociability. Having many friends and liking social activities 

E, Emotional stability. Composure, cheerfulness, evenness of moods 

O, Objectivity. Freedom from suspiciousness, from hypersensitivity, 
and from getting into trouble 


F, Friendliness. Respect for others, acceptance of domination, tolera- 
tion of hostility 


351 


Measurement of Personality 


T, Thoughtfulness. Reflective, meditative, observing of self and 
others 

P, Personal relations. Tolerance of people, faith in social institutions, 
freedom from faultfinding and from self-pity 

M, Masculinity. Interest in masculine activities, hard-boiled, not 
easily disgusted, versus (for femininity) romantic and emotionally 
expressive 


Multifactor batteries for the measurement of personality character- 
istics may eventually prove useful in school counseling programs and 
other practical aspects of school management. Presently such batteries 
are used almost exclusively in basic research, for example, in research 
on the relationships between personality characteristics of students 
and the effectiveness of different methods of instruction. 

Evaluation of Self-report Inventories. It is not necessary to be an 
expert in educational measurement to see that self-report inventories 
are not refined and exact tools of measurement. Some of the major 
reasons why are described as follows: 


I. Reliability. Self-report inventories are usually not as reliable as 
most tests of achievement and aptitude. Most of the commercially dis- 
tributed tests of achievement and aptitude have reliability coefficients 
of 90 or higher. Very few of the self-report inventories reach this high 
level of reliability, and many of them dip below .80. This means that 
the scores of students might change considerably from day to day, 
which makes the interpretation of results somewhat difficult. The re- 
liabilities of self-report inventories are seldom so low as to render them 
useless, but it usually can be expected that “chance” will play a larger 
part in them than it typically does in tests of aptitude and achieve- 
ment. 

2. Validity. It is extremely difficult to determine the validity of 
most self-report inventories. Whereas achievement tests and aptitude 
tests can be straightforwardly, if laboriously, validated as assessments 
and predictors, respectively, it requires a complex type of construct 
validation to determine the worth of most self-report inventories. (At 
this point it might be helpful to briefly review the three types of 
validity discussed in Chapter 2.) For example, in order to determine 
the construct validity of the “Friendliness” factor on the Guilford- 
Zimmerman Temperament Survey, it would be necessary to show cor- 
relations with many other indices of friendliness, e.g., ratings by 
friends, observations of behavior, and many others. Construct valida- 


tion is so complex, and requires so much time and energy, that very 


little of it has been done. 
For lack of empirical evidence, most self-report inventories have 


had to rely on face validity, which means, essentially, “Look at the 


352 
Prediction and Trait Measurement: Interests, Attitudes, and Personality 


items and see if you don’t think that they measure what I say.” In 
some instances this is not an unreasonable procedure. For example, if 
on a problem checklist, a student marks many problems directly con- 
cerning adjustment at home, it is reasonable to surmise that there ac- 
tually is something wrong in the home environment (if the student is 
reporting accurately). On many of the self-report inventories, the 
item content is so obliquely related to the supposed traits involved 
that it is not sensible to appeal to face validity. Also, face validity itself 
is a rather weak standard of validity. There are hundreds of examples 
of tests that “look” like they measure a particular trait, but when the 
evidence is in, it is found that either they measure nothing of conse- 
quence or something different from what was intended. 

3. Language difficulties. The validity of self-description inventories 
depends to a considerable extent on the clarity with which items are 
phrased. Consider, for example, a typical item such as “Do you usu- 
ally lead the discussion in group situations?” Respondents must in- 
terpret what is meant by “usually’—60 per cent of the time, 75 per 
cent of the time, or 90 per cent of the time. Does the word “lead 
mean to talk the most, make the most important points, or have the 
final say? Does the phrase “group situation” pertain only to formal 
groups such as club meetings, or does it include casual discussions 
among friends? This may be overdoing the difficulties of communicat- 
ing social traits, but it illustrates the need for language clarity in the 
phrasing of self-des ription items. 

In addition to the difficulty of wording items clearly, test constructors 
must be careful in describing the traits measured by inventories. This 
is true both of the inventories derived by factor analysis and those 
which are constructed on a “rational” basis. Some of the trait names 
appearing on current inventories are esoteric and confusing, such as 
“rathymia,” and “adventurous cyclothemia.” A school psychologist who 
uses self-description inventories must know the meaning of the traits 
being measured before the results can be put to any valid use, The 
problem is not confined to self-des iption inventories. Aptitude factors 
like verbal comprehension and perceptual speed might be misunder- 
stood by the test user, but there is more of a problem in communicating 
the meaning of personality factors. 

A serious consequence of the difficulty of naming and explaining 
personality traits such as those found in the multifactor inventories is 
that different inventories which purport to measure the same trait may 
have little in common. Correlations are sometimes very low between 
different inventories used to measure a trait such as introversion. Con- 
sequently, whether or not a person is said to be introverted depends 
on the inventory which is used. 

4. The acceptability influence. More important than the other road- 
blocks to the construction of self-description inventories is the fact 


353 


Measurement of Personality 


that the respondent can and usually does control the responses for his 
a strong drive in all of us to appear socially ac- 
able to ourselves. Acceptability in our society 
nt, courageous, courteous, kind, dominant, and so 
ficult for an individual to admit to himself that 
nd submissive. It is even more 


own ends. There is 
ceptable and accept 
means being intellige 
on. It is extremely di 
he is ignorant, cowardly, rude, mean, a 
difficult for an individual to admit these failings publicly. People in 
end to describe themselves in rosy terms, to an extent that 


general t 
value of self-description inventories. 


lowers the diagnostic 


A study by Edwards (24) demonstrates the extent to which the ac- 
ceptability influence dominates the responses made to self-description 
inventories. In the first part of his study, 152 subjects rated the social 
acceptability of each of 140 personality trait items on a nine-point 
scale. Scale values for the items were determined, showing the rela- 
tive negative or positive social value attributed to the traits. Next, 
Edwards made the 140 items into a personality inventory. A group of 
students was asked to indicate “yes” for each item that characterized 
“no” for items that did not characterize them. The propor- 


them and * 
ach item was determined. Edwards 


tion of persons answering “yes” to e 
lation of .87 between the judged social acceptability of 


eople who endorsed the items. This is 
nerally try to describe themselves in a 


found a corre 
items and the proportion of p 
strong evidence that people ge 
socially acceptable manner on personality inventories. 

The strength of the acceptability influence in self-description inven- 
tories depends on the degree to which the item alternatives differ in 
social acceptability and the punishments which result from not appear- 
ing socially acceptable. There is some variance in acceptability among 
the alternatives in attitude, opinion, and interest inventories, but not 
nearly so much as is involved in most self-description inventories. The 
punishment received for not appearing socially acceptable is either 
embarrassment or failure to obtain a sought-after job or position. 

The acceptability influence is less strong when respondents mark in- 
when they do not put their names on the forms, 
all results will be kept anonymous, and 
cted with selection procedures of any 
ails in most research studies, and 
escription inventories have their 


ventories in private, 
when assurance is given that 
when the inventory is unconne 
kind. This is the situation that prev 
it is in research studies that self-d 
most valid use. 


Projective Techniques 

r an approach to the measurement of 
different from that of the self-de- 
lescription inventories require 


The projective techniques offer 
is interestingly 


personality which 
s. Whereas the self-c 


scription inventorie: 


354 


Prediction and Trait Measurement: Interests, Attitudes, and Personality 


the subject to describe himself, the projective techniques require the 
subject to describe or interpret objects other than himself. The projec- 
tive techniques are based on the hypothesis that an individual’s re- 
sponses to an “unstructured” stimulus are influenced by his needs, 
motives, fears, expectations, and concerns. 

If there is an agreed-on public meaning for a stimulus, it is referred 
to as a structured stimulus. If there is no agreed-on public meaning for 
a stimulus, and in consequence there is considerable latitude for in- 
dividual interpretation, it is referred to as an unstructured stimulus. 
A structured stimulus is compared with an unstructured stimulus in 
Figure 16-2, First, what do you see in picture a? Nearly everyone will 


L 


(a) (b) 


Figure 16-2. Comparison of a relatively structured stimulus (a) with a relatively 
unstructured stimulus (b). 


say that it is a house. A few people might call it a school or even a 
jail, but people will generally agree that it is a dwelling of some kind. 
The shape of a house is a highly structured stimulus. Now what do 
you see in picture b? There is no accepted common meaning for that 
stimulus pattern. It might be interpreted as a thunderstorm, a dog, or 
an artist's palette. 

There is a considerable body of evidence to show that interpreta- 
tions of relatively unstructured stimuli are related to moods, needs, 
and expectations, Studies of the effect of food deprivation show that 
hungry subjects more frequently interpret ambiguous drawings as 
representing food than do nonhungry subjects. In one study a com- 
pletely blank screen was used, and subjects were led to believe that 
faint images were being presented, It was found that the number of 
food responses increased as the interval of food deprivation lengthened. 
Other studies show that perception is influenced by values, social 
taboos, and personal conflicts. An experience common to most of us is 


355 
Measurement of Personality 


the misreading of printed material in a way that indicates our con- 
cerns of the moment. For example, the student who worries over an 
examination coming the next day is likely to see in a hasty glance at 
the evening paper that “The police will give the examination tomor- 
row,” whereas a careful reading will show that “The police will give 
the explanation tomorrow.” 8 

If the individual interprets picture b in Figure 16-2 as a thunder- 
storm rather than as a dog, this may indicate something about him 
personally, However, it is one thing to say that a response is sig- 
nificant and quite another thing to say just what it indicates about the 


person. 
The instruments which will be discussed in this section are primarily 


based on the interpretation of personality characteristics from responses 
to relatively unstructured stimuli. The ultimate in unstructured stimuli 
is found in one of the pictures in the Thematic Apperception Test 
(TAT). The subject is asked to make up a story about a completely 
blank card. Other stimuli are structured to some extent to obtain in- 
formation about particular needs and concerns. For example, in an in- 
strument used to study attitudes, a picture which shows a white per- 
son and a Negro talking can be used. The stimulus is structured to that 
extent and in that manner in order to learn about attitudes toward 


Negroes. 

In schools pro 
cal psychologists 
training in the use of the 
trained examiner, and bec 
interpret many of the instruments, p 
employed only with those students w. 
emotional problems. 

Rorschach Technique. By far the most widely used projective 
technique is the Rorschach (4). It was developed during and after the 
First World War by Herman Rorschach, a Swiss psychiatrist. He ex- 
perimented with different ink blots to find a set which would provide 
the most insight into the nature of mental disorder. The ten ink blots 
which he settled on are still in use. An ink blot like those used in the 
test is shown in Figure 16-3. Five of the ink blot cards are made in 
shades of black and gray only. Two of the remaining cards contain 
bright patches of red in addition to shades of gray. The three remain- 
ing cards employ various colors. 

The Rorschach, like the other projective devices, should be admin- 
istered only by a highly trained examiner. The results will depend 
very much on the examiner's skill. The usual procedure followed in 

to talk with the respondent for a while 


administering the Rorschach is 
to gain rapport, seat him with his back to the examiner, and then in- 


jective techniques should be employed only by clini- 
and by school psychologists who have had intensive 
methods. Because of the need for a highly 
ause of the time required to administer and 
rojective techniques usually are 
ho show signs of having severe 


356 
Prediction and Trait Measurement: Interests, Attitudes, and Personality 


troduce the task with approximately the following instructions: “Peo- 
ple see all sorts of things in these ink blot pictures; now tell me what 
you see, what it might be for you, what it makes you think of.” The 
respondent is allowed to give as many interpretations as he likes. If he 
gives only one response and apparently tries to give no more, the ex- 
aminer suggests that he look for other things, saying something like, 
“People usually see more than one thing.” A typical series of responses 
to one card would be for the subject to report, “It all looks like a bat,” 
and “This part looks like a vase.” After looking at the card for a few 
more seconds, he might give as his final response, “This little bit looks 
like a nose.” 


Figure 16-3. An ink blot of the type used in the Rorschach test. 


Before interpreting the Rorschach, the examiner applies numerous 
scoring systems to the responses. Responses are scored in terms of 
content,” the types of objects which are seen. Some of the content 
categories are human, animal, anatomical, sex, food, and clothing. Re- 
sponses are scored in terms of the extent to which they are ones that 
people in general tend to give rather than ones that are highly idio- 
syncratic. Responses are scored in terms of the extent to which various 
factors influenced the perceptions, such as the coloring of the blot, the 
outline of the form, and others. Also responses are scored in terms of 
the amount of the blot that is involved, e.g., whether the student 
weaves the whole blot into one interpretation or makes separate in- 
terpretations of different parts of the blot. 


357 
Measurement of Personality 


Rorschach responses are interpreted in terms of psychoanalytic and 
other “depth” psychologies. The response summary alone is only a 
part of the material used in the interpretation. The trained examiner 
takes note of many complex relationships among content, location, and 
determinants. Thus, the movement response “children playing ball” 
might be interpreted quite differently from “men playing ball” in 
terms of the other responses in the record. A response that would be 
interpreted one way for a man is often interpreted differently for a 
woman. If a person who has never finished high school gives numerous 
anatomical content responses, it might be taken as an indication of 
morbid thoughts. If a college student gives many anatomical re- 
sponses, it might be interpreted as interest in and familiarity with 
biology. 

Some of the responses of a very depressed female student in the 
tenth grade will illustrate the features that go into an interpretation of 
Rorschach responses. In the classroom the girl shows obvions signs of 
personal maladjustment. Although she is from a prosperous home, she 
shows very little concern for neatness and often has dirt on her hands 
and face, She is overweight, cries at the slightest provocation, and is 
extremely dependent on anyone who will show her affection, The 
Rorschach record shows signs of deep depression and provides some 
clues about the basis of the problem. She gives only fourteen responses 
ewer than normal. She responds slowly 


to the ten cards, which is far f 
r for her responses. Most of her 


and seeks approval from the examine 0 
responses fall in the category of “poor perceptions. She mostly gives 
uncommon responses and ones that give little indication of creative 
imagination. There are many indications of gloomy mood including 
content concerning death, storms, and garbage. A clue to the depres- 
sive reaction is found in the sexual content of the responses. In por- 
tions of blots where such responses are very seldom given by people in 
general she sees underwear, & girl in bed, and a man in a bathing suit. 
On cards where portions of the blot quite frequently are perceived 
either as sex objects or sex symbols, she is highly embarrassed and 
apparently looks for some time to find “safe” parts of the blots to in- 
terpret. The sexual aspects of the depression are brought out in subse- 
quent therapeutic sessions with a clinical psychologist. It is found that 
and feels unloved, she has engaged in numer- 
in the hope that such would gain her 
the affection of boys. Coming from a highly religious home and having 
rather strict parents, she becomes increasingly ashamed of herself and 
fearful that her parents will learn the truth. The responses to the 
Rorschach had shown the nature of the problem and had provided 
clues that were useful in treatment. 

tation of Rorschach re: 


because she is not pretty 
ous acts of sexual promiscuity 


nR vonses is an extremely complex 
The interpre sS] 2 ) J 


358 
Prediction and Trait Measurement: Interests, Attitudes, and Personality 


task. Although there are reasonably clear-cut rules = N 
ual responses, there are only general standards and 7 0 5 9 
the final interpretation. The interpretation depends 1 1 
subjective impression of the examiner. It takes abant two r a 
practice, usually working in close collaboration with experien e 
aminers, to become proficient at interpreting Rorschach 1 m 

The Thematic Apperception Test (TAT ). The TAT, deve ah 15 
Murray (57) and his associates, consists of pictures o pops 5 
various settings. One of the pictures is shown in Figure 16-4. 


r 


P ; “est. (Ne- 
Figure 16-4. One of the pictures used in the Thematic Apperception Test ( 
produced by permission of Harvard University Press.) 


the pictures are more suited to 
some are more suited to m 
tured in such a way as to e 


young rather than older people, and 
ales than females. The pictures are struc 
licit responses concerning relationships 1 
tween various social roles and responses relating to different emotion fi 
The pictures are unstructured in the sense that a wide variety of in 
terpretations can be given about the 
persons shown. 


The subject is told to mak 
The instructions 


feelings and actions of the 


e up a story about each picture in Ba 
are approximately as follows: “I am going to sho 


359 


Measurement of Personality 


you some pictures. I want you to tell me a story about what is going on 
in each picture. What led up to it, and what will the outcome be?” 
The responses are either written down verbatim or a phonographic 
recording is made. 

No formal scoring system is used by the majority of TAT examiners. 
The examiner interprets the responses in terms of his knowledge of 
personality and his experience with the instrument. Some interpreta- 
tions are of a common-sense kind with which most examiners would 
agree. If a male student imputes unfriendly motives and actions to all 
the female characters, it strongly suggests that he is having troubled 
relations with women in real-life situations. If all the stories end in 
disappointment, embarrassment, and failure, it is likely that the stu- 
dent feels defeated and depressed. If the stories are lacking in pas- 
sion and violence, even in pictures where strong emotion is the evi- 
dent theme, it indicates that the subject is suppressing his own emo- 
tions. If any one type of social interaction, such as adultery, is seen in 
numerous pictures and in pictures where there is little to suggest it, 
this indicates that the subject is overly concerned about a particular 
issue. Although there is no standard procedure of interpretation, re- 
sponses to the TAT pictures provide many hints about the subject’s 
concerns, his conception of himself, and the way he views his human 
environment. 

The original set of pictures developed by Murray and his coworkers 
such sets of pictures that presently are in use. The 
Murray pictures sometimes are used in diagnosing the problems of 
high school students. More appropriate sets of pictures are available 
for younger students (75) and even for small children (6). Special sets 
of pictures have been composed for Negroes, Indians, and other racial 
and ethnic groups. Pictures of the TAT kind have been used to study 
anti-Semitism, family relations, attitudes toward military life, and 
many others. 

Other Projective Techniques. In addition to the Rorschach and the 
TAT, there are numerous procedures for evaluating personality which 
are best thought of as projective techniques. Almost anything that can 
be described, completed, or interpreted serves to some extent as a 
projective test. We tend to read our concerns and expectations into 


everything we do. Neva 8 

An old technique for learning about personality is to have an indi- 
vidual associate words. Various lists of words have been employed to 
get at particular kinds of reactions. The usual practice is to place cer- 
tinged words among relatively neutral terms. The 
ds would be useful in studying the home and 


is only one of many 


tain emotionally 
following list of words 


school adjustment of adolescents: 


360 
Prediction and Trait Measurement: Interests, Attitudes, and Personality 


I Ha a 11. Paper — 
2. Mother 12. Shoe — 

3. Home: = 13. Fight 
4. Desk 14. String — 
5. Bonk 15. Sisten 
6. Father 16. Cake 
7. School — 17. Body — 
8. LUV 18. Me á 2 
9. Tree 19. Brother 

LO; Hate 20. Friend 


The words are read to the subject one at a time. He is asked to give 
the first word that comes into mind. The examiner records the re- 
sponses and notes the time taken to respond. There are several ways 
in which the results are interpreted. The words which are associated 
often indicate the subjects attitudes toward persons and activities. In 
obvious cases where, for example, the association to “father” is “spank- 
ing” and the association to “hate” is “father,” there is the strong sug- 
gestion of a negative reaction to the father. Equally revealing as the 
associated words are the emotional reactions to the initial terms. If 
the subject takes a relatively long time to respond or gives signs of 
being embarrassed, a strong emotional reaction is suggested. Thus, 
even though the subject eventually supplies an innocuous association 
for “father,” such as “hat,” the long time taken to respond would sug- 
gest “blocking” and underlying conflict. A third type of information 
used in the interpretation is the tendency to give unusual associations. 
For example, most persons will associate “table” with “chair” or re- 
spond with some other related item of furniture. It is unusual to find 
the response “tiger” to “chair,” and when numerous such associations 
are made, it might be related to mental illness. . 

A technique which is similar to word association is the use of in- 
complete sentences. Examples are as follows: 


I dislike most to 

I wish that I had never 
Most people are 

I become embarrassed _ 
The people I like most 


TRON — 


The usual procedure is to have subjects write their responses, per- 
mitting the testing of a number of persons at one time. The sentences 
can be structured to provide information on different areas of adjust- 
ment. No effort is made to time the responses. Consequently, the sub- 
ject has time to make up whatever responses he chooses. The responses 
are analyzed similarly to those on the TAT, That is, the moods, mo- 


361 


Measurement of Personality 


tives, solutions, and expectations portrayed in the responses are in- 
terpreted with respect to the subject's personality. 

A method which is especially useful with children is to employ play 
materials as a projective device. Children betray their feelings quite 
readily in play activities of all kinds, and almost any set of play mate- 
rials serves as a “test.” Dolls and puppets are used most often for this 
purpose. The situation can be structured by, for example, naming the 
dolls “mother,” “father,” “little brother,” “me,” and so on. Also the 
environment for the dolls can be structured to some extent by having 
present toy implements, such as a baby bottle, a toilet, a bed, doll 
clothing, and others. The child is encouraged to play with the material 
he chooses. A revealing set of actions would be for 
Il for “mother” and “little brother” with their 
gives the bottle to “me.” Play activity 
iggestions about the feelings and 


in whatever way 
the child to place the do 
faces to the wall while “father” 
of this kind is usually very rich in sv 
concerns of children. 

Artistic productions can be used as projective devices. These may 
be almost completely unstructured like finger painting, or they may be 
structured to the point of asking for the drawing of a man. The ad- 
vantage of the relatively unstructured task is that it is often not per- 
ceived as a test. A widely used procedure is to furnish clay to children 
and let them make whatever they like. The product can be analyzed 
for the symbolism apparent or simply in terms of the actions imputed 
to the clay figures. If the child makes a clay image of himself, it is im- 
portant to note whether the bodily parts are in proper proportion. An 
outsized nose or excessively small arms might offer suggestions about 
the child's concept of himself. Children often manifest strong emo- 
tic productions that they would not talk about openly. 
hild might make a clay figure of mother, 
tear off the arms, and finally throw 


tions in artis 
For example, a disturbed c 
then run over her with the toy car, 
her in the wastebasket. 

Evaluation of Projective Techniques. Some adherents of projec- 
tive techniques feel that their instruments lay open the depths of per- 
sonality to observation. Rather sweeping generalizations often are 
made about the response to an unstructured stimulus. Before the reader 
about his own personality as it might be mirrored 
a look should be taken at the factual basis 
of the major points to consider are as 


becomes alarmed 
in the projective techniques, 
for the instruments. Some 
follows: 

1. Reliability. The ordinary measures of reliability are difficult to 
obtain with projective tests. If, as on the 5 —̃— uam 
are obtained prior to the interpretation, the reliability of these can be 


studied by split-half, equivalent-form, and other techniques. The re- 


362 


Prediction and Trait Measurement: Interests, Attitudes, and Personality 


liabilities of single components on the Rorschach are less than those 
obtained with most tests of human ability but not so low as to render 
the indices unusable. There are two reasons why the reliability of 
projective techniques cannot be determined from component scores. 
First, most of the projective devices employ few, if any, scores for 
separate responses. Second, even if separate responses are scored, the 
interpretation is the final test result, not the initial scores. Conse- 
quently, it is necessary to test the reliability of interpretations from 
the test. 

Whereas it is expected and usually necessary that scores on most 
tests remain stable over moderate periods of time, it is not necessarily 
expected with the results of projective techniques. If the scores made 
by students on an intelligence test fluctuate markedly over a period of 
six months or even several years, the use of the instrument would be 
seriously impaired. However, it is to be expected that some of the 
personality attributes mirrored in the projective techniques should 
sometimes change substantially in relatively short periods of time. For 
example, the Rorschach or TAT responses of an individual would 
likely change from the beginning to the end of a successful psycho- 
therapy. Instruments like the TAT are probably affected to some extent 
by day-to-day changes in moods and by good and bad turns of events. 
Consequently, it is difficult to untangle the expected changes in re- 
sponses from the measurement error inherent in the testing procedures. 

Tt is doubtful that any one test can make the sweeping observations 
about personalities which are often claimed. Different techniques 
probably have different strong and weak points as personality meas- 
ures. Because an instrument is shown to be reliable, it does not neces- 
sarily mean that it is valid for any purpose; but if it can be shown that 
interpretations regarding certain kinds of traits are unreliable, it means 
that they cannot be valid in any sense. Systematic studics of the 
reliability with which different projective techniques lead to interpreta- 
tions of various personality characteristics would help define tech- 
niques in terms of what they measure, at least reliably, and would 
narrow the field of investigation for subsequent studies of validity. 

2. Validity. Projective techniques must be classified as trait meas- 
ures by default. It is difficult to argue that the projective techniques 
are assessments of personality. For example, if an individual calls an 
ink blot a butterfly, there is no reason to believe that this response 
represents anything about his personality unless evidence is provided 
to prove that such is the case. Consequently, the validity of projective 
techniques can be determined only by correlating interpretations with 
important behaviors outside the testing situation. 

In comparison to the many applications of projective techniques, 
there are few studies of predictive validity. Consequently, no firm state- 


363 


Measurement of Personality 


ments can be made concerning the validity of projective techniques as 
a group or about the validity of particular instruments. One of the 
problems is finding important variables for the projective testers to 
predict. The contrasted-groups study has been used most often to 
validate projective techniques. In a typical study an examiner is given 
the Rorschach records of both normal persons and individuals who are 
diagnosed as schizophrenic. The examiner must decide from the test 
hether each person is from the normal or the schizo- 
phrenic group. Studies of this kind indicate that the Rorschach and the 
TAT are moderately successful in differentiating normal persons from 
the mentally ill. However, it should be pointed out that this is a very 
weak test of validity. The instruments are often used to make fine 
en people in the normal range and to differentiate 


results alone w! 


distinctions betwe 
among types of mental illness. 

One of the difficulties in validating and improving the projective 
techniques is an unwillingness on the part of some devotees to subject 
their instruments to empirical investigation. A kind of cultism which 
encourages faith in the instruments rather than a healthy scientific 
skepticism has arisen among some projective testers. T hey would like 
other people to accept their projective devices and the elaborate in- 
terpretations which they make as self-evidently valid. It is encourag- 
ing to see that many exponents of projective techniques are aware of 
the need for empirical investigations and are busily performing the 
necessary research. 
s defined earlier as a standardized 


3. Standardization. A test w 
situation which provides the individual with a score. Do the projective 
? In comparison to most tests, the 


techniques meet the requirements s 
projective techniques are relatively unstandardized. Although efforts 
are made to standardize the presentation of material to the subject, 
there are inevitable differences in the approaches used by different 
examiners. Much apparently depends on the way the examiner acts 
kind of person he is. With an instrument like the Rorschach, 
ain more responses than other examiners, 


and the 
some examiners typically obt 
and women examiners sometimes obtain responses different from those 
obtained by male examiners. 

The final results of projective techniques, the descriptions of indi- 
vidual personalities, are highly dependent on the intuitive judgment of 
the examiner. Not only are examiners unable to catalogue all the rules 
which they use in reaching interpretations, but they are probably not 
aware of many cues which they employ. The examiner is not a person 
who simply administers the test and as such plays a minor role in the 
result The examiner is part of the projective technique and inseparable 
st materials. Some examiners are undoubtedly more effective 


from the te 
descriptions. Consequently, the 


than others in deriving personality 


364 


Prediction and Trait Measurement: Interests, Attitudes, and Personality 


validity of the technique is interwoven with the ability of the ex- 
aminer who uses it. Some efforts have been made to more fully stand- 
ardize projective techniques, particularly the Rorschach (42). 

4. Special advantages. One of the foremost advantages of projec- 
tive techniques is that most of them are difficult to fake. The subject 
is usually unaware of how his responses will be interpreted. The person 
who tries to distort responses often gives himself away. It was men- 
tioned earlier in connection with the word-association technique that 
an effort to cover up unpleasant feelings can usually be detected by 
the relatively long time taken to respond. There are many other ways 
in which the experienced examiner can detect what the subject tries 
to hide. Even professional testers find it difficult to distort their own 
responses to projective techniques. 

An advantage which is shared by most of the projective techniques 
is that they can be administered to persons of all ages, ethnic groups, 
and intelligence levels. The instructions are very simple and, in most 
cases, neither reading nor writing is required. The projective tech- 
niques are particularly applicable to children. Children who are un- 
able or unwilling to discuss their problems directly usually react to the 
projective techniques as though they were games. i 

5. Summary evaluation. The projective techniques are ingenious 
efforts to measure personality variables. Many of the interpretations 
that arise from them appeal to common sense and fit in with psycho- 
logical theory also, Unfortunately, the techniques are relatively un- 
standardized, and it is difficult to determine how well they work, Some 
projective testers have made unwisely sweeping claims for particular 
instruments. The techniques are often said to measure the “whole 
personality” and the “total behavior pattern.” It is doubtful that activi- 
ties so circumscribed as responding to ten ink blots or making up 
stories about particular pictures will lead to such broad conclusions 
about the complexities of human personalities. The indicated directions 
for future research are to standardize the projective techniques and to 
determine the kinds of personality attributes which each measures 
most effectively, i 


Observation of Behavior 


Rather than infer personality characteristics from paper-and-pencil 
or projective tests, another approach is to observe people as they 
actually behave. As a simple example, a disturbed child can be 
observed as he plays with a group of children. If the observations can 
be reliably recorded and scored, they can be used as personality 
measures. The advantage of observational testing is that it has a real- 
ife quality not shared by conventional testing instruments. 


365 


Measurement of Personality 


Many everyday decisions about people are, of course, reached by a 
form of observational analysis. The football coach observes the fresh- 
man quarterback to see how well he passes, kicks, and runs with the 
ball. The new bank teller is judged in terms of his promptness, ac- 
curacy in maintaining accounts, and courteousness to customers. The 
new cook is judged by her cakes and pies and by the neatness of the 
kitchen. The following sections will discuss some of the ways in which 
behavioral observation can be used to measure personality traits. 

Ratings. Ratings are used very widely as a means to record be- 
havioral observation. This is frequently done in the elementary grades 
as an adjunct to the report card. Figure 16-5 shows a typical set of 
scales used on report cards, Figure 16-6 shows a set of rating scales 
that could be used with high school and college students. 


Shows 
satisfactory Needs 
growth improvement 


Takes pride in his work and completes it — 

Responds courteously and cheerfully to 
school regulations 

Works well by himself 

Shows self-control 

Respeets the rights and property of others — 

Pays courteous attention while others are 


speaking 

Works well with others 

Takes pride in el ccomplishments and 
school activities 

Makes good use of his time 


Figure 16-5. Typical rating scale used on report cards. 


Ratings are often said to be more objective than self-description 
inventories. It is certainly true that other people are less sensitive in 
nt’s shortcomings than he himself would be. However, 


recording a stude 
and only after these have 


there are numerous pitfalls that beset ratings, 
arded against will ratings provide a valid picture of personality. 
One of the most common faults of ratings is due to lack of informa- 
tion about students being rated. In the elementary grades, teachers 
usually have a great deal of first-hand experience with their students, 
tting. In high school, teachers may see students 


at least in that one se 
only during a small portion of the day, and, consequently, they are not 


sufficiently familiar with students to make valid ratings. 
Even if teachers are in close proximity to students over a long period 
of time, they still may not have sufficient evidence for making valid 


been gu 


366 
Prediction and Trait Measurement: Interests, Attitudes, and Personality 


ratings. Typically, teachers are most familiar with the extremes. The 
very “good” and the very “poor,” the very “healthy” and the very 
“sick” stand out; and teachers usually can make valid ratings about 
their personalities. The majority of students, however, do nothing 
either so good or so bad as to make their personalities clear, and, conse- 
quently, relatively unreliable ratings are made of them. 


Much below} Below A Above | Much above 
average | overoge] e foge average | average 


Courtesy al 


Intelligence 1 
Moral character 
Personal appearance 
Health 
Ambitiousness 


Friendliness 


Creative ability 


General knowledge ii 
Writing skill 
Emotional stability 


Diligence 


Figure 16-6. A typical set of rating scales. 


Even when teachers have extensive experience with students in their 
particular classroom setting, they may know little about how they 
behave in other classes, at play, and at home. The first step in obtain- 
ing valid ratings is to ensure that teachers have both an extensive 
acquaintance with students and that they have witnessed behavior in 
situations relevant to the traits being rated. 

In addition to lack of information about students, ratings typically 
suffer from a number of other faults. One of these is personal bias 
toward students. It is difficult not to give better ratings to students 
whom we personally like than to students whom we do not know very 
well or whom we do not like. Another source of error is to rate all 
students generally high or generally low. Some teachers have a posi- 
tive bias, tending to rate all students above average. Other teachers 
have a negative bias, tending to rate all students below average. Then, 
obviously, whether a student is rated as having a good rather than a 
bad personality depends on the happenstance of which teacher makes 
the ratings. 

A form of bias called the “halo error” consists of giving all bad, all 
average, or all good ratings to students. Rather than think differentially 
about the strong and weak points in students, it is tempting to think 


367 


Measurement of Personality 


of them as being all bad or all good. Consequently, teachers are prone 
to rate a student in much the same way on different rating-scale items 
even when that is not the true picture. For example, even if a student 
works well with others, and follows instructions, it does not necessarily 
mean that he is happy or has good health habits. 

A number of things can be done to improve ratings. One, which was 
mentioned previously, is to provide more and better opportunities to 
observe students. In this connection, it would be better to have ratings 
made by only those teachers who see the students during a considera- 
ble portion of the school day. Also, ratings are usually more valid if 
they are made near the end of the term rather than in the early “get- 
acquainted” weeks of the term. A second way to improve ratings is to 
train teachers for the task. They should be told about the various 
types of errors that occur and given extensive practice in making 
ratings. Third, when possible to do so, substantially more valid ratings 
are obtained by averaging those given by two or more teachers. This 
tends to iron out the biases of individual teachers and greatly reduces 


the chance element. 


Peer Ratings. In addition to having ratings made by teachers, for 


some purposes it is useful to have students rate one another, Such 
ratings are particularly helpful to the teacher in understanding the 
social problems of students. One method of obtaining peer ratings is 
called the “guess who” technique. Some sample items are as follows: 


Name 


Guess who is the best liked boy in class. —— 
Guess who is the best baseball player in class. — —ꝛ— 
Guess who follows directions best. — 
Guess who starts the most arguments. a 
Guess who is the most generous boy. pe 
Guess who is the most selfish boy. 


ults of “guess who” items is to count 


A simple way to analyze the res 
s name is placed in each blank. This 


the number of times each student’ 
might show that Fred Cincwich is nominated by over half the class as 


“best liked boy in class” and that Maurey Lawson is nominated by over 
half the class as “starts the most arguments.” Such findings would 
provide many clues as to how students react to one another and 
provide the teacher with valuable information for helping individual 
students. 

Another type of peer rating is obtained by having students select 
those students that they would most like as friends or would most like 
as partners in particular activities. The results from a set of choices 
can be plotted as a diagram, or sociogram, as it is called, showing the 


368 
Prediction and Trait Measurement: Interests, Attitudes, and Personality 


pattern of choices. A sample sociogram is shown in Figure 16-7. The 
sociogram provides a handy picture of the pattern of choices. The one 
in Figure 16-7 shows, for example, that student 3 is an isolate, being 
chosen by none of the students. Students 5, 8, 11, and 12 form a close- 
knit clique. Student 12 is especially popular in that group, receiving 
the first-choice nominations of the other three members. Many other 
interesting relations can be seen in the pattern of choices. Relations of 
the kind shown in the sociogram would be useful in dividing the class 
into work groups and would suggest ways in which relations between 
particular students might be improved. 


—> First choice 
—---> Second choice 


Figure 16-7. Sociogram showing the choices for work partners in a group of 
twelve students. 


There are two major principles to follow in obtaining peer ratings. 
First, the ratings must be simple and directly understandable to the 
students. It would, for example, be quite inappropriate to seek ratings 
for such abstract traits as extraversiveness or neurotic tendency. The 
language should be kept simple, and the ratings should pertain directly 
to the student's world. The second principle is that ratings should be 
entirely anonymous. Students should be told “No one else in the class 
will see your ratings.” Anonymity is necessary both to protect the feel- 
ings of students who receive “bad” ratings and to elicit honest responses 
from all students. 

Behavioral Tests. Another method of behavioral observation is to 
collect objective products of activity in lifelike situations. The test 


369 


Measurement of Personality 


concerns what an individual actually does rather than ratings of his 
behavior, One of the earliest and still the best-known use of behavioral 
tests was that of Hartshorne and May (40) of the Character Education 
Inquiry. They wanted to measure traits in school children, such as 
honesty, truthfulness, cooperativeness, and self-control. Rather than 
use conventional tests or ratings to measure these characteristics, they 
chose to observe the actual behavior of children with respect to the 
traits. The observations were conducted in the normal routine of school 
activities, in athletics, recreation, and classroom work. 

Observations were made with respect to each trait in such a way as 
to provide an objective score. For example, one of the measures of 
cheating was obtained by allowing students to grade their own papers 
and noting the number of alterations of answers. Tests like vocabulary 
and arithmetic reasoning were administered in the classroom. The tests 
were collected and a duplicate copy made of each. The original un- 
scored papers were returned to students along with a list of the correct 
answers. Children scored their own papers and either gave themselves 
de or altered answers to improve their standings. Scores 
then compared with the scores students 
actually attained, as measured by the duplicate test copies. The amount 
of discrepancy between the two scores provided a measure of cheating. 

Another of the Hartshorne and May tests concerned the trait of 
charity. Each child was given an attractive kit, including ten articles 
such as pencil sharpener, eraser, and ruler, After the children had 
examined the materials for some time, they were allowed to give away 
some or all of the items to “less fortunate children.” The children were 
not coerced to donate, and the donations were made anonymously. 
Each child was provided with a large envelope in which to put his 
donation, and the donations were dropped into a common box. Un- 
known to the children, envelopes had been marked in such a way as to 
identify the donations of each child. The number of articles donated 
served as a measure of charity. 

When behavioral tests of the kind originated by Hartshorne and 
d, they have a number of attractive advantages. The 
ioral products frees the measurement procedure 
rating scales. If observations can be made in 
the subject is unaware that he is being tested 


the correct gra 
reported by students were 


May can be use 
use of actual behav 
from the subjectivity of 
natural situations where 
in any sense, the results are 

Other than the Hartshorne and May studies, there have been few 
attempts to develop systematic behavioral tests, although much the 
same thing is done informally in many evaluational efforts. The major 
uses of behavioral tests have been with children. This is because chil- 
ituations or easier to place in group 


to find in group s! 
m suspect an ulterior purpose. Also, the 


probably more valid. 


dren are easier 
situations without having the 


370 


Prediction and Trait Measurement: Interests, Attitudes, and Personality 


behavioral products of children are usually simpler and more easily 
measured than complex adult interactions. Behavioral tests are some- 
times used with kindergarten and nursery school groups. Behavioral 
records can be kept either surreptitiously by an examiner, or the chil- 
dren can be watched through a one-way vision screen. Notes are made 
of the number of times a child offers a toy to others, the number of 
times he asks for adult help, and other relevant behavior. In order to 
get at more complex responses, it is sometimes necessary to use both 
direct recording of actions and ratings of behavior to measure such 
traits as responsiveness and tendency to withdraw. 


Promising New Approaches 


Admittedly the methods of personality measurement described so 
far in this chapter are not nearly as good as they should be. A great 
deal of research presently is being done to derive better methods. 
Although it is too early to judge which of the new methods will prove 
most fruitful, the range of new methods being developed, and the 
findings which are being reported, are encouraging. Some of the most 
encouraging new approaches are described in the following sections. 

Physiological Measures. It is a truism to say that the way a person 
is physically constituted has a strong influence on his personality. 
Although much of what we call personality is the result of long and 
complex learning experiences, much also is apparently determined by 
the particular muscular, neural, and chemical makeup of the individual. 
Insofar as relevant physical and physiological traits can be measured 
and validated, they can be used as tests. 

There is a wealth of evidence to demonstrate connections between 
physical states and personality. When people are sick, they seldom 
behave the same as when they were well. They often are more easily 
disturbed, quick-tempered, and depressed. Ordinary medicines which 
are taken for colds and other disorders affect our moods and social 
behavior. Different drugs tend to have different effects. One drug tends 
to make people elated and another tends to depress. The impact of 
neural functions on personality is often witnessed in the altered 
behavior of people with brain damage. Even the relatively mild 
physiological impact of the changing weather seems to alter our moods 
and social behavior. 

In recent years numerous studies have been made of individual 
differences in nervous-system functioning, particularly in relation to 
the concept of “arousal.” The nervous systems of some people typically 
are “aroused”—ready for action. The nervous systems of other people 
typically are less aroused, or less excited. Arousal is measured by 
respiration rate, heart rate, electrical resistance of the skin, salivation 
rate, muscular tonus, and others. Differences in arousal might prove 


371 


Measurement of Personality 


to be important dimensions of personality. Numerous findings suggest 
that levels of arousal relate to memory, problem-solving ability, and 
social traits. 

The blood stream carries not only the nutrients and regulators of 
bodily activity but substances which influence our moods and social 
behavior. Insofar as chemical differences among people can be 
shown to relate to personality differences, a blood test might prove to 
be a perfectly sensible approach to predicting certain kinds of social 
behavior. Most of our knowledge about the relationship between social 
behavior and blood chemistry comes from introducing chemicals into 
the blood and noting their effect. Experimental studies of the effect of 
drugs and glandular injections on social behavior exemplify this 
approach, However, it is more interesting, from the standpoint of 
psychological measurement, to search for existing chemical differences 
among people and to relate these to social behavior. 

The use of chemical indices to measure certain types of personality 
characteristics has been encouraged by findings that suggest that dif- 
ferences in blood chemistry exist between the mentally ill and normals. 
Although it may be sometime before such chemical differences are 
well known, common sense suggests that differences exist. It may seem 
somewhat fantastic to consider the possibility at this time, but the 
years ahead well may see the development of blood tests to help in the 
early detection of the tendency toward certain types of mental illness. 

Response Sets. It is possible to differentiate people not only in 
terms of the test scores which they make but in terms of the ways in 
which they take tests as well. Regardless of what the test is about, 
people bring with them certain test-taking habits, or “response sets,” 
which, in part, determine their scores. One of the oldest observations 
along these lines is that personality, or as it was called in older days, 
“temperament,” is involved in psychophysical studies of judgment. The 
subject can be asked either to judge whether stimulus A is greater or 
less than B or he can be provided With a third choice, an “indifferent” 
category, in which the judgment is “neither less nor greater. It Was 
noted early that some subjects use the ‘indifferent category” much 
more than other subjects do, regardless of the judgments being made, 
indicating the presence of a personality trait seemingly involving the 
“willingness to take a stand.” 

One type of response set which has received some attention lately is 
that of “acquiescence.” Acquiescence is usually studied with either 
very ambiguous or very difficult statements like the following: 


The moons of Saturn were first dis- 


covered by Gustav Whittenborn. Yes _ No —— 


An excess of porthymadrone results 


in dilation of the pupils. Yes No. = 


372 
Prediction and Trait Measurement: Interests, Attitudes, and Personality 


Very few persons are likely to have any first-hand information about 
statements like those above. Although both of them are faise, they 
might as well be true as far as most persons are concerned. When 
presented with a longer list of such questions, some persons charac- 
teristically agree (answer “yes”) and others show a tendency to dis- 
agree. People who give a preponderance of “yes” answers are said to 
acquiesce. That is, they seem to be pushed into agreement by the force 
of the statements alone. 

There are numerous other response sets which can be measured. 
When a rating scale is used instead of a simple “yes” or “no” answer, 
people tend to pile up their responses at different points on the con- 
tinuum. If, for example, opinions are being studied with a seven-point 
continuum which ranges from “strongly agree” to “strongly disagree, 
some people will most often mark the extremes and some people will 
put most of their marks in the middle of the continuum. 

Another interesting response set is the tendency to deviate from the 
“usual” response. This can be studied with questions relating to self- 
conception, interests, attitudes, and even esthetic preferences. In such 
studies it is found that some people give responses which are markedly 
different from what the “average” person gives, e.g., express prefer- 
ences for geometrical designs which most people would consider un- 
pleasant. If a person deviates from the average in this way on many 
types of materials, it might indicate that he also is deviant in his social 
behavior. Some research findings suggest that deviant responses relate 
to certain forms of mental illness. ; 

Verbal Behavior. Another approach to the study of personality is 
through the typical words that a person uses. It is well known that 
word usage is related to intelligence and amount of education, and it 
may also be the case that language habits relate to personality charac- 
teristics. The author (59) and his colleagues have been investigating 
some dimensions of language habits. One of these is the tendency in 
conversation to use “pleasant” words like good, pretty, kind, and sweet. 
Another is the tendency to use “unpleasant” words like bad, ugly, 
mean, and bitter. A third is the tendency to use words relating to the 
detailed characteristics of things, such as green, heavy, sharp, and 
furry. A fourth is the tendency to categorize things, using words like 
plant, group, institution, and tool, These are only some of the interest- 
ing ways to classify the different modes of verbal response to objects, 
people, and ideas; and each of these modes of response is potentially 
useful as a measure of personality traits. The findings with respect to 
one of these types of verbal habits will illustrate the possibilities. 
People who tend to use many “unpleasant” words also tend to be 
unpleasant people. They do not think highly of themselves, and they 
tend to be introversive and neurotic. As is true of the other new 


373 


Measurement of Personality 


approaches to personality measurement described in this chapter, not 
enough research has been done to speak with firmness, but the study 
of individual differences in verbal behavior offers another promising 
approach to the measurement of personality characteristics. 

Judgment and Perception. Probably the most important discoveries 
in relation to the measurement of personality are those that show that 
some of the so-called measures of ability actually relate to personality. 
Some measures of perception apparently relate to personality charac- 
teristics as well as to abilities. Tests of embedded figures and hidden 
figures, of the kinds described in Chapter 11, apparently correlate with 
measures of “independence,” in the broad sense of the word. People 
who perform well on such tests tend to “think for themselves” without 
being overly swayed by the opinions of others. 

Another perceptual measure which apparently relates to “independ- 
ence” is that of orientation toward the vertical. The measure is ob- 
tained in a completely darkened room. The subject sees a luminous 
rod in the center of a luminous square frame. The frame is alternately 
tilted to the left and right. By a system of controls, the subject is asked 
to fix the rod so that it is straight up and down. Some persons are 
markedly affected by the distracting frame, and, consequently, when 
they set the rod as vertical, it is actually much to one side. Other sub- 
jects are not fooled by the distracting frame, and, consequently, they 
are able to correctly adjust the rod to a vertical position. The ability 
to accurately perform this perceptual task apparently relates to social 
independence in much the same way as do embedded-figures tests. 

The influence of suggestibility on perceptions and judgments pro- 
vides another possible way to learn about personality. There are a 
number of ways in which such measures can be made. One is to have 
subjects stand with their eyes closed while the tester suggests that 
they are falling. The tester says “You are falling, you are falling, etc.” 
Some subjects do not budge, some sway markedly, and others have to 
be prevented from falling to the floor. The amount of sway in such 
tests apparently relates to social suggestibility and to measures of 
neuroticism, Another measure of suggestibility concerns the perception 
of apparent movement. In a darkened room, the subject sees a circle 
of light on a wall. He is told to report when the light moves, in which 
direction it moves, and how far it moves. Actually, the light does not 
move at all, but to many persons it looks as though it does. The amount 
nt movement offers another possible measure of suggestibility. 
ristic which has been related to per- 
sonality is “dark vision.” Dark vision is the ability to see well at night 
or in darkened rooms, in contrast to the usual standard of good vision, 
the ability to see well in daylight or in lighted rooms. Odd as it may 
there is some evidence to indicate that people with poor dark 


of appare 
Another perceptual characte 


seem, 


374 
Prediction and Trait Measurement: Interests, Attitudes, and Personality 


vision tend to be neurotic. (The reader should keep in mind that, as is 
true of all of the statistical tendencies cited in this book, statistical 
trends must allow for many exceptions. Even if it is true that there is 
a correlation between dark vision and neurotic tendency, many persons 
will have poor dark vision because of physical abnormalities, not 
because they are neurotic. ) 

Numerous other examples could be given of tests that ostensibly 
measure judgments and perceptions which also apparently relate to 
personality characteristics. Such tests have obvious advantages over 
self-report, projective techniques, and observation. To the extent to 
which measures of perception and judgment actually relate to per- 
sonality characteristics, they offer the hope of a truly objective meas- 
urement of personality in the years ahead. 


Summary 


There is no doubt that personality traits are very important things 
to consider in making decisions about everyday classroom activity as 
well as in many other types of educational decisions. In spite of the 
urgent need, at the present time only approximate methods of per- 
sonality measurement are available. The three approaches most fre- 
quently used are self-report inventories, projective techniques, and 
observation. 

On the face of it, one would think that self-report should provide 
valuable measures of personality. The individual “accompanies him- 
self” wherever he goes and is in a position to observe himself in many 
types of social interactions. However, people sometimes are blissfully 
unaware of their own personality traits. Also, when people are asked 
to describe themselves on personality inventories, they tend to distort 
their responses in such a way as to place themselves in the best light. 
In addition, personality inventories are beset by language problems, 
and their results are influenced by numerous types of response sets. 

The advantage of projective techniques (at least potentially ) is that 
they can go beyond what the individual knows about himself and is 
willing to report. The difficulties with projective techniques are (a) 
they require an expert examiner, (b) some examiners are much better 
than others at interpreting the results, and (c) they are prohibitively 
time consuming and expensive for routine testing in schools. Presently 
some efforts are being made to develop better-standardized and more 
easily employed projective tests, and eventually these may add ma- 
terially to available methods. 

The use of observation to study personality has a real-life quality 
not shared by self-report inventories and projective techniques. Un- 
fortunately, except for rating scales, observational methods are very 


375 


Measurement of Personality 


time consuming and difficult to employ. Rating scales are used fre- 
quently to measure personality characteristics, and if certain cautions 
are heeded, they usually possess at least modest validity. If teachers 
make ratings of students, they must beware of several types of artifacts 
that tend to lower the validity. Also, if any faith is to be placed in the 
results, it must be ensured that teachers actually have sufficient ac- 
quaintance with students in situations relating to the particular traits 
being rated. A valuable addition to having teachers rate students is to 
have students rate one another. In many ways students are far better 
acquainted with one another, particularly with their own feelings 
about one another, than the teacher ever could be. 

Although we have not yet reached a high level of validity in the 
measurement of personality characteristics, current research gives hope 
that better methods will be available in the future. Although the 
evidence is still only suggestive, apparently some types of personality 
characteristics are involved in tests that, on the surface, appear not to 
concern personality. Among these are measures of blood chemistry, 
neural activity, response sets, verbal behavior, perception, and others. 
It is hoped that in the years ahead we will be able to develop per- 


sonality measures of the same validity, and which can be used as 


routinely, as measures of aptitude and achievement. 


Suggested Additional Readings 


Anastasi, Anne. Psychological testing. (2nd ed.) New York: Macmillan, 1961, 


chaps, 18-21, 
Cronbach, L. J. I 
& Row, 1960, chaps. 15-19. ; 
Guilford, J. P. Personality. New York: McGraw-Hill, 1959. o 
Thorndike, R. L. and Hagen, Elizabeth. Measurement and evaluation in psychol- 
ogy and education, (2nd ed.) New York: Wiley, 1961, chaps. 12-15. 


ntials of psychological testing. (2nd ed.) New York; Harper 


part V 1 


Development of 


Testing Programs 


It is only natural to expect that by the time the reader 
has reached this point in Educational Measurement 
and Evaluation, he is becoming somewhat over- 
burdened with facts and principles concerning the 

use of tests. Also, he is probably growing a little 
weary of the topic and is happy to see that only one 
chapter remains. What remains to be done is to try to 
tie together some of the most important principles in 
the book and relate those to the practical problems of 
obtaining and using tests. An effort will be made to 
do this in the final chapter. 


chapter 1 7 


Development of 


Testing Programs 


Even though by now the reader has, we hope, learned a great deal 
about the facts and principles of educational measurement, there still 
are some important things which he may not know. What tests should 
be used in a comprehensive testing program in elementary school? 
When should tests be given? Where do you obtain information about 
makes decisions about which tests to use, and 
tests? How do you obtain the cooperation of 
proper use of tests? These are some of the 


particular tests? Who 
who administers the 
teachers and parents in the 
questions with which this chapter is concerned. 


School-wide Programs 


It is important to distinguish between the tests that teachers either 
construct or purchase for use with their own classes and those that 
are regularly administered throughout the school. For testing their 
own students, teachers usually have considerable autonomy in selecting 
or constructing tests and in deciding how they will be used, Also, 
teachers usually feel relatively free to alter their methods of test con- 
struction and to change their schedule of testing. In contrast, tests 
used in a school-wide program are usually selected by specialists in 
the particular school system, with the help and advice of teachers and 
school administrators. Once tests are selected, they usually are em- 
ployed at regular intervals throughout the school or school system. 
Because it disrupts the testing program to change tests from year to 
year, or to change the times of testing, school-wide programs should 
be carefully planned in advance so that such changes are minimal. 

Looked at purely from the standpoint of measurement theory, there 
ests that should be used. Poten- 


is no end to the number and kinds of t 
an provide valuable 


tially, any test administered at any point in time e 
information to help in making educational decisions. There are, how- 


379 


380 


Development of Testing Programs 


ever, some practical considerations that mitigate against the use of 
many many tests. Obviously, it takes time to find, order, use, score, and 
interpret tests. Although one test for one student, e.g., a measure of 
intelligence, may cost as little as 25 cents, many tests given to many 
students can place a strain on the school budget. Between the first 
grade and high school graduation, students literally spend hundreds 
of hours taking tests of one form or another; and to the extent that it 
is possible, any unnecessary new testing should be discouraged. 

The following sections will describe some of the major aspects of an 
acceptable testing program. Only some of the most prominent types of 
measures will be mentioned, e.g., intelligence tests and achievement 
tests. The many types of measures not mentioned are usually employed 
at the discretion of the individual school, or individual teacher, and at 
times that best fit the particular needs. 

Group Tests of Intelligence. Group tests of intelligence are most 
useful in the lower elementary grades, and they decline in usefulness 
from that point on. (The decline in usefulness is not so much a reflec- 
tion on the tests as it is a credit to the usefulness of“ competing“ 
sources of information, particularly that from school grades and from 
achievement tests.) The cornerstone of a school-wide t testing program 
should be the administration of one of the better group tests of intel- 
ligence to all beginning first graders. This will supply the teacher and 
the school with valuable information about what to expect from each 
child. In terms of the total amount of information supplied about a 
child at a particular point in time, this undoubtedly is the most valua- 
ble test the child will ever take. Rather than administer the test the 
first or second day of school, it is best to we ait ten days to two weeks. 
This will give the child time to “settle down,” learn to follow instruc- 
tions, and get used to the classroom setting. Most group tests of intel- 
ligence can be administered by the teacher in his own classroom. 

In addition to administering a group test of intelligence to begin- 
ning first graders, it is wise to obtain group measures every other year, 
up through at least the seventh or ninth grade. As a very minimum, 
group tests should be given at least at one point in the middle of 
elementary school, say at the beginning of the fourth grade. It is 
hoped it was made quite clear at numerous points in the book that it 
is dangerous to assume that abilities remain absolutely stable over 
long periods of time. Ability tests are good predictors of school prog- 
Tess during the one or two years following the test administration; but 
they often are rather poor predictors of performance five or more years 
later. Consequently, there is no substitute for repeated testing with 
some of the most important types of measures. 

Whether or not group tests of intelligence continue to be used 


381 
Development of Testing Programs 


beyond elementary school largely depends on what other tests are 
employed, which is a point that will be considered in a later section. If 
no better measures of “abstract ability” are available, it would be wise 
to apply measures of general ability at two points in high school, most 
probably at the beginning of the ninth and eleventh grades. 

Individual Tests of Intelligence. Nearly all the experts would agree 
that for testing young children the individual measures are preferred. 
Most experienced examiners can obtain a very reliable indication of 
intelligence from even the shy or obstreperous child, Why then are 
individual measures not made on all children? The answer is very 
simple: They are far too costly. It takes an expert about half a day to 
administer, score, and interpret an individual test. In terms of cost, 
this would mean about $25 for each child. Few schools are so rich as 
to afford the luxury of individual testing of intelligence. 

Individual testing should be done only for those children who 
make exceptional scores on the group measures, either exceptionally 
high or exceptionally low. These are the children who probably will 
need special attention and instruction, and it is important to carefully 
document the extent to which they are exceptional. Typically it w. ould 
be expected that only sev eral children out of a beginning first-grade 
class of twenty-five would ever need individual testing. 

In addition to routinely administering individual tests to all those 
children who make exceptional scores on group measures, individual 
measures are frequently employed by school psychologists and guid- 
ance counselors with respect to any problem child. Regardless of 
whatever else may lie behind the problem, intelligence is one of the 
most important things to consider. When it is possible to employ them, 
individual measures are preferred for use with problem children. 

Multifactor Aptitude Batteries. In spite of the fact that there 
definitely are important subdivisions (factors) of intelligence, very 
little use is currently made of multifactor batteries, at least not until 
the eleventh or tw elfth grades. At the earlier years of elementary and 
secondary school, the measures of general intelligence still dominate 
the scene and will probably continue to for some time to come. 

The major reasons why multifactor batteries are not used more fre- 
quently are (a) good batteries are not available for students below 
the age of about fifteen, (b) the batteries would be too time consum- 
ing and expensive for most schools to employ, and (c) the multitude 
of information obtained from the tests would be difficult for most 
teachers to properly interpret. 

Multifactor batteries are presently used largely for students in the 
last two years of high school. At this level they are preferable to 
measures of general intelligence. The multifactor batteries are helpful 


382 
Development of Testing Programs 


in (a) understanding the problems that students have with particular 
school topics, (b) advising students on courses of study in high school, 
(c) vocational guidance, and (d) planning for higher education. 

Comprehensive Achievement Tests. Equally important to the use 
of intelligence tests with primary-grade students is the routine use of 
comprehensive measures of achievement. (As will be remembered 
from Chapter 9, comprehensive measures contain material relating to 
all, or most, of the topics in particular grades.) A comprehensive 
battery should be given to every student, every year, at least up 
through the elementary grades, and preferably on through high school. 
(The battery should be given near the end of the school year. The test 
manual usually states the proper time for testing.) The test results not 
only provide a valuable supplement to teacher-made tests, but they 
also are helpful in determining how well the class as a whole is doing. 
Now that the yearly, or other periodic, application of comprehensive 
achievement tests has become routine in most schools, it is hard to see 
how we previously did without them, 

Achievement Tests for Special Topics. Whereas most schools 
periodically apply comprehensive achievement tests, there is con- 
siderable variability among schools in the use of achievement tests for 
special topics. Most widely used are special tests for reading achieve- 
ment in the primary grades. Such special tests are needed because (a) 
some comprehensive achievement tests do not fully cover the range of 
reading skills, and (b) the once-a-year comprehensive tests do not 
come frequently enough to provide the teacher with on-the-spot in- 
formation about how well students are progressing, 

When the school can afford to do so, it is good practice to administer 
special achievement tests for reading skills several times each year for 
children in at least the first four grades. Also, special achievement tests 
are available for many other topics, e.g., mathematics, and to the 
extent that the school can afford the money and the time to use them, 
they provide very helpful supplements to teacher-made tests. 

Achievement tests for special topics come into their own again in 
secondary school, particularly in the tenth through twelfth grades. At 
these levels the curriculum usually encompasses a wide variety of 
special topics, e.g., biology, and, consequently, it becomes difficult to 
adequately represent all areas of study in one comprehensive measure. 
For such topics as chemistry and biology, special achievement tests 
provide very useful information. 

Diagnostic Achievement Tests. From Chapter 10 it will be remem- 
bered that a diagnostic achievement test is intended to “look inside” 
the child’s scholastic work habits in a particular area of instruction. 
Truly diagnostic tests are almost exclusively limited to reading and 
mathematics, and to the first eight grades. Diagnostic tests would be 


383 
Development of Testing Programs 


used more frequently if better ones were available and if the present 
ones were less difficult to administer and score. Most schools use 
diagnostic measures only for those children who have real difficulty 
in mastering either reading or mathematics. In these cases, diagnostic 
measures often provide valuable clues about faulty work habits. 

Interest Inventories. Interest inventories relating to vocational pref- 
erences are seldom regularly administered to all students at any grade 
level. Because interests relating to vocational activities tend to stabilize 
only near the end of high school, interest inventories are primarily used 
for vocational and career guidance of students at that level. In those 
high schools where a guidance counselor is available, interest inven- 
tories are frequently administered to students who seek help in plan- 
ning their futures. In addition to the use of inventories relating to 
vocational interests, teachers often construct or borrow inventories re- 
lating to daily activities to help in making decisions about schoolwork 
and recreation. 

Personality Tests. Only self-report inventories are widely used in 
schools, and schools vary considerably in terms of the amount of use 
made of them. Some schools do not routinely use personality inven- 
tories, and others apply them to all students at regular intervals. At the 
elementary school level, the routine use of one of the problem check- 
lists (of the kind described in Chapter 16) would probably be worth 
the time and expense involved. For students in secondary school, a 
variety of self-report measures are available. At those levels, personality 
inventories are mainly used as part of the diagnosis and treatment of 
students who have special interpersonal and scholastic problems. They 
usually are, and should be, applied and interpreted only by qualified 
school psychologists and guidance counselors. 

Projective techniques often are used by guidance counselors and 
school psychologists with students who give strong indications of 
having severe personal problems. Because they are very time-consum- 
ing techniques to apply, and because they require a highly trained 
examiner, they probably will continue to be applied in only very 
special cases. 

It is hoped that the years ahead will provide economical and valid 
measures of personality that routinely can be applied as part of the 
school testing program. Such measures would add valuably to the 
types that are now routinely used, but until better measures are 
available, we will have to limp along on what we now have. 

Other Measures. Above are listed the major types of instruments 
that are found in school-wide testing programs. In addition, most of 
the other types of measures described in this book are potentially use- 
ful to the individual teacher or as part of a school program. For ex- 
ample, either the school as a whole, or individual teachers, may want 


384 


Development of Testing Programs 


to employ one of the measures of students’ attitudes toward instruc- 
tion, peer ratings, or tests of musical aptitude. 

Some of the types of measures described in this book are not fre- 
quently employed in school-wide programs and not frequently used 
by individual teachers, such as measures of motor skills, creativity, and 
some measures of attitudes and personality. But it is important for 
teachers to understand all these measures because they frequently are 
used in educational research. Teachers are. or at least should be, very 
much concerned with the results of educational research, and they 
cannot understand these results unless they understand the measures 
used in research, $ 


Who Does What 


As is the case with many enterprises, it is easy for each of us to 
adopt the misconception that a testing program is started and managed 
by “someone else,” some vaguely defined “expert” who will run the 
whole show. In fact, a thorough program of testing must have its roots 
directly in the classroom, and many people must cooperate to ensure 
that the program is a success. 

It is not always possible to say exactly “who does what” in particular 
testing programs. Job titles do not always convey the specific functions 
of people. For example, although we can talk about the part that 
school psychologists usually play in testing programs, it might not 
apply at all in particular school § stems. However, there is enough in 
common among different school systems to talk about the roles usually 
played by different professional workers, 

School Psychologists. School psychologists usually have doctoral 
degrees, either the Ph.D. or the Ed.D. Usually they are well trained 
in general psychology, learning theory, clinical psychology, and, above 
all, in educational and psychological measurement. In a school system, 
they usually are the highest experts available on measurement prob- 
lems. There are not nearly enough school psychologists to fill all the 
needs, and it may be a very long time before enough are available. 
Consequently, it is seldom that a school psychologist will work exclu- 
sively in one school. Rather, he will often operate centrally from the 
school board and be directly responsible to the superintendent of 
schools. 

If a standard testing program is used throughout a school system, the 
school psychologist will have a lot to say about the instruments which 
will be used and how they will be used. If each school originates its 
own testing program, the school psychologist usually will be consulted, 
In addition to playing an important part in the management of testing 
programs, the school psychologist participates in the diagnosis and 


385 


Development of Testing Programs 


treatment of problem children. He might, for example, be asked to 
examine a third-grade boy who is far behind in reading ability. School 
psychologists usually are skilled in the use of individual tests of intel- 
ligence and projective tests of personality, and they often are required 
to use these in diagnosing problem cases. Individual testing of problem 
students and holding conferences with teachers and parents about the 
results occupy much of the school psychologists’ time. In addition, they 
usually are called on for expert advice about any problem relating to 


educational measurement. 

Guidance Counselors. In some cases the boundary between school 
psychologists and guidance counselors is quite blurred, but there are 
some characteristics that usually distinguish the two. It is rare that 
guidance counselors hold doctoral degrees. Instead they are likely to 
hold master’s degrees in guidance counseling. Many colleges and uni- 
versities now offer special graduate level programs for that purpose. 
Unfortunately, many of the persons who now are called guidance 
counselors either have no advanced degree or do not have an advanced 
degree in guidance counseling. As is true of school psychologists, there 
are not nearly enough well-trained guidance counselors to fill the many 
needs, and it is often the case that some well-meaning, but untrained, 
teacher will assume the title in a particular school. 

In contrast to the school psychologist, the guidance counselor almost 
always works exclusively in one school. Whereas school psychologists 
seldom teach regular courses, ordinarily guidance counselors do 
teach within the school. Usually they are given some reduction in 
teaching load to allow them time to perform their special duties. 

Guidance counselors function much the same within a particular 
school as school psychologists do throughout a school system. Although 
they seldom are truly expert in problems of educational measurement, 
they often have had advanced course work in the area. Within a school, 
8 counselor is regarded as the local advisor on testing. 
ds about new tests and new developments in testing and 
h sources for obtaining information about tests. If a 
am is established throughout a school system, the 
guidance counselor generally coordinates the testing program in the 
school with that in the system as a whole. If there is no uniform pro- 
gram throughout the school system, and, if, as is often the case, no 
school psychologist is available, the guidance counselor may have 
major responsibility for originating and managing a program in a 
particular school. 

Guidance counselors usually spend considerable time in dealing with 
problem children. They generally are skilled to some extent with indi- 
vidual tests of intelligence and with projective techniques, and they 
these in diagnosing problems. In elementary school, they often 


the guidance 
Usually he rea 
is familiar wit 
uniform testing progr 


use 


386 


Development of Testing Programs 


teach, or supervise the teaching of, special remedial classes. Also, they 
often consult with teachers about problem students and have inter- 
views with parents and other interested persons. 

At the high school level, the guidance counselor has somewhat dif- 
ferent functions. He is often the official disciplinarian as well as the 
person who typically works with any other type of problem student. 
In addition, he helps students plan their high school curriculum and 
counsels them about their vocational and future academic plans. For 
these purposes, he often uses interest inventories and tests of special 
aptitudes. 

School Administrators. Principals, school superintendents, and 
other administrators are involved to a greater or lesser degree in test- 
ing programs depending on the extent to which well-trained school 
psychologists and/or guidance counselors are available. If such special- 
ists are available, administrators usually do, and should, rely heavily 
on their advice about matters relating to educational measurement. 
When specialists are not available, it necessarily will be the case that 
administrators will play leading parts in the selection and use of 
standardized tests. Administrators vary in expertness about educational 
measurement all the way from having considerable graduate-level 
training to having had no formal course work on the topic. All school 
administrators should have at least enough grounding in educational 
measurement to understand why certain tests are used in school 
programs and what they are intended to measure. 

Teachers. Even if teachers had no say about how testing programs 
were constituted, they would need to have a good grounding in educa- 
tional measurement. They would need this, not only to effectively 
measure the day-to-day progress of their students in class, but also 
to interpret the results of school-wide testing programs, The individual 
teacher is the ultimate recipient and user of tests given in a school- 
wide program, and if he does not understand the purpose and nature 
of the tests, everything goes to waste. 

Actually teachers should, and usually are, prominently involved in 
the establishment of school-wide programs. These days teachers are 
becoming more sophisticated in the technical aspects of educational 
measurement, and they can help make decisions about which tests to 
use and how to use them. In addition, teachers have a better oppor- 
tunity than anyone else to judge how helpful particular tests are in 
making educational decisions. The teacher can tell whether or not a 
remedial reading test helped locate the difficulties of a particular 
child, whether a personality inventory helped find the troubles of 
another, whether a particular achievement test adequately sampled 
mathematics problems, and so on. 

Before selecting particular tests, school psychologists and guidance 


387 


Development of Testing Programs 


counselors usually seek the advice of teachers about the adequacy of 
the measures. A good way to do this is to have a committee composed 
of teachers, school administrators, guidance counselors, and school 
psychologists to make decisions about various aspects of a testing 
program. Also, if they are available in the community, it is wise to 
obtain the advice of psychologists and educational specialists from 
colleges and universities. The pooled wisdom of the group will usually 
lead to a well-conceived testing program, and the committee can 
continually look for methods to improve the program. Also, if repre- 
sentatives from all the major professional groups are involved in 
making decisions about a testing program, they will tend to be coopera- 
tive in helping to make the program effective. 


Sources of Information 


When starting a school-wide testing program, where do you obtain 
information about tests? If a particular test is recommended by a 
friend, how do you determine whether or not the test is any good? If 
you would like to examine a particular test, where can you obtain a 
copy? These and other questions relating to sources of information 
about tests will be discussed in this section. 

It is hoped that this book will, in part, serve as a source of informa- 
tion about particular tests. The book is not primarily intended to be a 
reference source for all tests available, nor a source of critical reviews 
for tests. However, many tests have been used in previous pages to 
illustrate principles of educational measurement, and readers might 
want to adopt some of these in their own classroom or school-wide 
testing programs. The author tried to illustrate principles of testing 
with instruments that he considered to be generally good, and other 
tests are described in Appendix D, but those should not be relied 
upon as an infallible guide. Literally, thousands of tests are available; 
therefore, any author is bound to be somewhat limited in the range 
of his acquaintance with particular tests. Rather than rely exclusively 
on textbooks for information about particular tests and kinds of tests, 
it also is wise to consult the sources of information mentioned in the 
following sections. 

Mental Measurement Yearbooks. By far the most valuable sources 
of information about tests are the five Mental Measurement Yearbooks 
prepared by Buros (15). The Yearbooks contain detailed information 
on thousands of tests covering not only all the major types of instru- 
ments but also very specialized instruments such as tests of Hebrew 
aptitude. The major parts of the Yearbooks are concerned with critical 
reviews of tests by experts. The number of reviewers used for each test 
is determined by the wideness with which the test is employed. The 


388 


Development of Testing Programs 


reviewers give detailed, critical information and opinions about each 
test, pointing out specific advantages and disadvantages. Following is 
a portion of the review by Professor John E. Milholland (15, pp. 350- 
351) of the Lorge-Thorndike Intelligence Tests: 


This test is admirable for the clarity with which its objective is stated 
and for the restraint exercised in the claims for what it will do. It is 
frankly labeled an intelligence test, and we are told that it is a test of 
abstract intelligence, defined as “the ability to work with ideas and the 
relationships among ideas.” There is, of course, no precise objective 
criterion for this definition, so one is forced to rely upon indirect evi- 
dence, inspection of the items, and the professional reputation of the 
authors for the assessment of this kind of validity. All three lines of evi- 
dence are confirmatory. i 

The suggestions made in the manual for the use of the results are 
reasonable and practical and do not rely upon exorbitant claims for what 
the test is measuring. The authors recommend administering both the 
verbal and nonverbal batteries in grades for which both are available, 
and state that “the functions are sufficiently similar so that, for most 
pupils, it will be appropriate to average the 1Q’s from the two batteries 
to yield a single more comprehensive and more reliable estimate of in- 
tellectual ability. However, in about 25 per cent of cases, the two forms 
will yield 10's differing by as much as 15 points. In these cases, the 
difference may have practical significance in relation to a pupil’s reading 
level, school achievement, or vocational planning.” 

With the possible exception of the Word Knowledge and Arithmetic 
Reasoning, the subtest titles simply describe the types of items they con- 
tain. This should certainly reduce any temptation to try to interpret sub- 
test scores, and, in keeping with this point of view, the authors present 
no subtest norms. 


Later in his review, Professor Milholland says: 


The examiner’s manuals seem to be especially well adapted for use by 
classroom teachers. They contain directions for administration, sugges- 
tions for using the test, and tables of norms. The two paragraphs explain- 
ing the standard error of measurement should probably be expanded. As 
they stand, these paragraphs might be more confusing than helpful toa 
Breat many teachers 


For each test reviewed, the Yearbooks provide many other useful 
items of information, including (a) test publisher, (hb) grade levels for 
which the test ig appropriate, () prices of testing materials, and (d) 
any special features. In addition to the reviews, the Yearbooks contain 
many references to research articles and books relating to particular 
tests and to measurement problems in general. Anyone who deals 


389 


Development of Testing Programs 


extensively with tests should obtain a copy of the most recent Mental 
Measurements Yearbook. 

Tests in Print. A very useful companion book to the Yearbooks is 
Tests in Print (14), which is essentially a list of all the tests that could 
be found in an extensive search of the Yearbooks, test catalogues, 
professional journals, and other sources. In the list are 2,126 tests 
presently in print and 841 tests which are out of print. The major 
advantage of the book is that it tells you where to obtain critical 
reviews and detailed information about each test. Many of the refer- 
ences are to reviews in the five Yearbooks; others are to professional 
journals, publisher's manuals, and books. In addition to serving as a 
master source of references for tests, the book supplies a number of 
kinds of pertinent information about each test including (a) publisher, 
(b) age and grade levels for which appropriate, (c) number and 
types of forms available, (d) names of subtests, and (e) any special 
features, Tests in Print is the place to start to find the tests available 
in a particular area, e.g., achievement tests for Latin, and for finding 
detailed information and critical reviews about particular tests. Like 
the Yearbooks, Tests in Print is a “must” for anyone who deals exten- 
sively with standardized tests. 

Test Publishers. Other important sources of information are test 
publishers. The commercial concerns most widely engaged in publish- 
ing tests are listed in Appendix A. Each major publisher has, and 
usually will mail on request, a catalogue of tests which lists all the 
tests available, accompanied by pertinent information such as price, 
grade levels for which test is appropriate, testing time, and number of 
forms that can be had. To learn more about particular tests listed in 
the catalogue, teachers can obtain a “specimen set” of test material and 
a testing manual. The combined cost of these is seldom more than $2. 

By inspecting the specimen set of test materials, teachers can judge 
whether or not the material is appropriate for the measurement prob- 
lem at hand. The test manual usually provides detailed information 
about how the test was constructed, standardized, and validated. Also, 
test manuals usually provide norms, directions for administering and 
scoring, and suggestions for interpreting the results. is 

It is too much to expect that commercial distributors of tests will be 
completely nnbiased in describing their wares They are in the business 
of selling tests, and it is quite natural to expect that they will make 
their teste Sound as attractive as possible. Most tosting concorns aro 
relatively honest in mentioning some of the limitations of the 
Dut it is hard for them to be as dispassionately critical as 
reviewer might be. That is why it is essential to consul 
reviews in the Yearbooks and elsewhere before 
adopt a particular test. 


iv tests, 
an expert 
t expert critical 
finally deciding to 


390 
Development of Testing Programs 


Test manuals vary greatly in the carefulness with which directions 
are spelled out and in the amount and quality of research evidence 
presented. It is hoped that the principles stated in this book will help 
teachers evaluate the claims and evidence presented in test manuals. 

Professional Journals. Because the Mental Measurements Year- 
books have, on the average, appeared only about every four years, it 
has been difficult to obtain critical reviews and detailed information 
about new tests. The only way to obtain such information about re- 
cently developed tests is through research and review articles in pro- 
fessional journals in psychology and education, Most teachers will not 
have the inclination to regularly pursue the technical fare of these 
professional journals, but if they want up-to-date information about 
new tests, that is the only major resource. Research evidence on tests 
is reported in a number of journals, particularly in the Journal of 
Consulting Psychology, Journal of Counseling, Personnel and Guidance 
Journal, Educational and Psychological Measurement, and Journal of 
Educational Research. Two master guides can be used in the search 
for articles relating to particular tests, For psychological journals, 
Psychological Abstracts provides a rather complete listing of articles in 
different areas, and a short summary is provided for each. For journals 
in the field of education, the Educational Index provides a very wide 
listing of journal articles, Using these sources, one could, for example, 
find references to articles dealing with research on achievement tests 
or intelligence tests. 


Some of the Most Important Uses of Tests 


Throughout the book it has been emphasized that tests should be 
used only if they help in making educational decisions. The word 
“decision” is used broadly in relation to changes of curriculum for all 
or some of the students, changes in plans by parents about the study 
practices and future education of their children, grade placement and 
promotion of students, changes in self-conceptions of students about 
their own abilities and personality characteristics, and many others. If 
tests are employed only because it is fashionable to do so, and if test 
results actually do not influence educational decisions, then it is foolish 
of the school to waste the time and money involved. In each chapter, 
efforts have been made to show how particular kinds of tests can be 
helpful in making particular kinds of decisions, and it would be re- 
dundant to rehash all that material here; but it might be helpful to 
summarize some of the most important kinds of decisions for which 
tests are helpful. 

Administrative Decisions. In every school the principals office 
needs test results to help in making decisions about students. Typical 


391 
Development of Testing Programs 


of the decisions that must be made is that of the grade placement for 
a transfer student. The student may have moved from a far-away state 
in which the schools operate quite differently from those in the new 
region. In what grade should the student be placed? Does he need 
remedial work in some topics, or is he capable of moving on to 
advanced work in other topics? Another type of administrative decision 
is that of determining whether or not an apparently slow learner should 
be removed from the regular school curriculum and given special 
schooling. It would be very difficult to make such decisions without 
the information obtained from teacher-made tests and commercially 
distributed tests of achievement and aptitude. 

Decisions in Counseling. Tests are the mainstay of those who 
specialize in counseling and guidance work. If a child is having dif- 
ficulty in keeping up with his class, is it because of a lack of aptitude 
or because of other factors? Should the parents be consulted, and if so, 
what should they be told? Tests of intelligence, diagnostic achievement 
tests, and results from teacher-made tests would help in making those 
decisions. 

In high school, many students seek counseling because they are 
having personal problems or because they are unsure of what they 
should study now and what they will want to do after they graduate. 
Personality tests and tests of interests and special aptitudes help both 
the counselor and the student in discussing the particular problem. 

Decisions in Research. Whole schools and individual teachers are 
constantly conducting informal experiments about the effectiveness of 
different approaches to education. Mrs. Brown tries a new approach 
to teaching algebra. Does it work? The only way to tell is to compare 
teacher-made and standardized achievement test results of students 
who learn by the new method with students who learn by the old 
method. 

In addition to the many kinds of informal experiments that teachers 
conduct on their own instructional practices, psychologists and educa- 
tional specialists conduct many systematic studies in school settings. 
They want to make decisions about the effectiveness of teaching 
machines in the learning of foreign languages, the differences F 
attitudes toward higher education engendered by different types of 
curricula, the impact of the classroom on the personal and social 
development of children, and many others. It would be all but impossi- 
ble to validly conduct such research were it not for the availability of 
many types of tests. i 

Classroom Decisions. Perhaps most important of all, 
of tests are directly helpful to teachers in making day-to-d 
about what goes on in the classroom. Is the class be 
numbers skills, and should proportionately more 


many types 
ay decisions 
hind in learning 
time be devoted to 


392 
Development of Testing Programs 


that topic? Should Lewis Martin be given extra homework in spelling? 
Should Joe Stevens be allowed to skip the fifth grade? Would it be 
wise to suggest to Anne Jackson’s parents that she do remedial work 
in summer school? In these and in countless other decisions teachers 
are aided materially by the results from their own tests and stand- 
ardized tests provided by outside agencies. 


A Philosophy of Measurement 


In this final section it is important to consider the proper attitudes 
to hold toward the use of tests in making educational decisions. 
Although standardized tests are becoming widely accepted by both 
educational experts and the public at large, there still are many who 
criticize their use. The major criticisms are of two kinds. The first is 
that the wide use of tests leads to “unfair” educational practices, and 
the second is that tests are poor measurement devices. 

Regarding the first point, critics will say that tests serve to “brand” 
students, unfairly segregate them into ability groupings, restrict the 
ranges within which students are allowed to grow and change, and 
encourage unhealthy feelings of superiority and inferiority in children. 
These criticisms are potentially correct, but it must be firmly kept in 
mind that it is not the tests, per se, that bring about such unfortunate 
consequences but rather the improper use of test results. Tests are 
intended to supply information, and it is not the fault of the tests or 
the people who construct them if that information is unwisely used. 
When important decisions are to be made, it would be foolish to ignore 
any worthwhile source of information; and tests usually rank high in 
the importance of the information which they supply. 

Part of the feeling that tests are “unfair” springs from the cherished 
concept in our society that “all men are created equal.” We all are (or 
should be) equal in the social and legal sense, but we are not equal in 
our abilities and personality characteristics. To ignore such differences 
would deprive help to those students who need special attention; 
would cause chaos in planning the future education and careers of 
students; and would deprive teachers, parents, and students themselves 
of information which they badly need. 

The criticisms that tests are poor measurement devices take several 
forms. One of these is to challenge the ability of any paper-and-pencil 
device to get at the “real” understanding of school subjects, These 
critics often point to specific items on tests that are concerned with 
trivia or that are so poorly formulated that even the most knowledgea- 
ble student is likely to get the wrong answer. Of course, it is a chal- 
lenge to measure the more important aspect of learning, but many ex- 
amples have been shown in this book of how it can be done. 


393 
Development of Testing Programs 


Critics also point to those instances in which aptitude tests (such as 
tests of intelligence) greatly underestimate how well students will 
perform. The affection that all of us hold for the underdog makes us 
feel good when we hear of the lad who is judged to be below average 
by an intelligence test but who manages to graduate from college with 
a Phi Beta Kappa key. But such instances are quite rare; and all con- 
cerned should realize the improbability of such events before time, 
money, and hope are invested in them. 

Tests are neither perfect measures of current achievement and per- 
sonality, nor perfect predictors of later adjustment and accomplish- 
ment; but they are by far the best indicators available. Whenever the 
validity of tests is compared with what would be relied on if tests 
were not available, e.g., impressions of teachers, the tests clearly do 
better. Tests are here to stay, and they are growing in importance each 
year. The only proper attitude is to want to make tests better and 
better and better. 


Suggested Additional Readings 


Hill, G. E. and Scott, J. D. School testing program inventory, Athens, Ohio: 
Center for Educational Service, Ohio Univer., 1960. 

Kent Area Guidance Council. A proposed 12-year testing program, Columbus, 
Ohio: State Department of Education, 1959. 

Ross, C. C. and Stanley, J. C. Measurement in today’s schools. (3rd ed.) Engle- 
wood Cliffs, N.J.: Prentice-Hall, 1954, chaps. 11-16. 

Thorndike, R. L. and Hagen, Eli beth. Measurement and evaluation in psychol- 
ogy and education. (2nd ed.) New York: Wiley, 1961, chap. 16. 


appendix 


Major Publishers of 
Psychological and Educational 


Tests (Test catalogues 


sent on request) 


Long Island, N.Y. 


Acorn Publishing Co., Rockville Centre, 
Kansas State Teachers College, Emporia, 


Bureau of Educational Measurements, 


Kans. 

Bureau of Educational Research and Service, State University of Iowa, Iowa City, 
Iowa. 

Burcau of Publications, Teachers College, Columbia University, New York 27, 
N.Y. 


California Test Bureau, Del Monte Research Park, Monterey, Calif. 
Consulting Psychologists Press, Inc., 577 College Ave., Palo Alto, Calif. 
Cooperative Test Division, Educational Testing Service, Princeton, N.J. 
Educational Test Bureau, 720 Washington Ave., S. E., Minneapolis, Minn. 
Educational Testing Service, Princeton, NJ. 
Gregory (C. A.) Co., Test Division of The Bobbs-Merrill Company, Inc., 1720 
East 38th St., Indianapolis 6, Ind. i 
Harcourt, Brace & World, Inc., New York, N.Y. 
Houghton Mifflin Company, 2 Park St., Boston 7, Mass. 
Ohio Scholarship Tests, State Department of Education, 751 Northwest Blvd 
Columbus 15, Ohio. $ 3 
Psychological Corporation, 304 East 45th St., New York 17, N.Y. 
Psychometric Affiliates, Box 1625, Chicago 90, III. 
Public School Publishing Co., Test Division of The Bobbs-Merrill Company, Inc 
1720 East 38th St., Indianapolis 6, Ind. panya AEs 
Scholastic Testing Service, Inc., 3774 West Devon Ave., Chicago 45, Ill 
Science Research Associates, Inc. 9 East Erie St., Chicago 11, lll. aiid 
Sheridan Supply Co- P. O. Box 837, Beverly Hills, Calif. 5 * 
Stoelting (C. Hy 75 oe North Holman Ave., Chicago 24, III. 
P esity 40 a) 1 S i 5. ks 
We Legen Press Ltd., Little Paul's House, Warwick Square, London 
Western Psychological Services, Box 775, Beverly Hills, Calif. 


395 


appendix 


Proportions of the Area 
in Various Sections of the 


Normal Distribution“ 


z 
Standard Area Area 
score (z/c) belween beyond 
(1) (2) (3) 
+ and — 0.00 0.0000 1.0000 
0.05 2 9602 
0.10 9204 
0.15 -1192 8808 
0.20 1586 8414 
＋ and — 0.25 1974 8026 
0.30 2 
0.35 2736 
0.40 3108 
0.45 3472 
+and — 0.50 3830 
0.55 4176 
0.60 4514 
0.65 4544 
0.70 5160 


If, for example, in a normal distribution of test scores, you want to esti- 
mate the number of persons who make scores between plus one standard devia- 
tion of the mean and minus one standard deviation of the mean, you would 
look opposite 1.00 in the first column at the proportion in the second column. 
There it is seen that the proportion is .6526 or, in other words, approximately 
68 per cent. This means that approximately 32 per cent of the individuals 
make scores either greater than one standard deviation above the mean or less 
than one standard deviation below the mean. If you want to determine the 
proportions of people who lie within or beyond certain standard score units 
above the mean only or below the mean only, the proportions in columns 2 and 
3 should be halved. 


IQR 


397 
Proportions of the Area in Various Sections of the Normal Distribution 


2 
Standard Area Area 
score (r) between beyond 
a) (2) (3) 
+ and — 0 75 
0.80 
0.85 
0.90 
0.95 
+ and — 1.00 6826 
1.05 7062 
1.10 7286 
1.15 7498 
1.20 7698 
+ and 1.25 7888 
1.30 
1.35 
1.40 1616 
1.45 1470 
+ and 1.50 
1.55 x á 
1.60 8904 1096 
1.65 9019 0990 
1.70 9108 0892 
＋ and — 1.75 9198 0802 
1.80 2 0718 
1.85 j 0644 
1.90 5 0574 
1.95 9488 0512 
＋ and — 9544 0456 
0404 
9642 0358 
9684 -0316 
-9722 0278 
+ and — 9756 0244 
9786 -0214 
9812 0188 
9836 0164 
9858 0142 
+ and — 2.50 9876 0124 
2.55 9892 0108 
2 9906. 0094 
2 9920 0080 
2: 9930 0070 


398 
Appendix B 


z 
Standard Area Area 
score (x/c) between beyond 
a) (2) (3) 
+ and — 2.80 9948 0052 
2.90 9962 0038. 
3.00 9 0027 
3.10 99806 00194 
3.20 99862 00138 
＋ and — 3.40 99932 -00068 
3.60 99968 00032 
3.80 999856 000144 
4.00 9999366 0000634 
4.50 9999932 0000068 
5.00 00000058. 
6.00 000000002 


appendix 


Statistical Appendix 


1. Corrections for attenuation. Because of the measurement error 
inherent in all tests, correlations among tests are less than they would 
otherwise be. To the extent to which correlations are lowered in this 
manner, it is said that they are “attenuated” by unreliability. The more 
reliable the tests, the less the attenuation; the less reliable the tests, the 
more the attenuation. According to the theory, completely unreliable 
tests could not possibly correlate other than zero with any other meas- 
ures (except for the departures from zero correlation that would occur 
as a function of sampling errors). If in examining a correlation the re- 
liability of one or both tests is known, the theory of measurement error 
allows us to estimate what the correlation would be if the reliability 
of one or both of the measures were increased. ý 

A simple formula allows us to estimate what the correlation would 
be if measurement error were entirely removed from one of the two 
measures, that is, if one of the measures were made perfectly reliable. 


The formula is 


where ri. = obtained correlation between tests 1 and 2 
ry = reliability of test 1 

Fia = estimated correlation between tests 1 and 2 if test 1 = 
not test 2) were made perfectly reliable 


To illustrate the use of the formula, assume that a vocabulary test 
(test 1) correlates 48 with a mathematics test (test 2). A prior study 
of the reliability of test 1, say comparing alternate forms of the test 
produces a reliability coefficient of .64. At this point assume that wa 
have no information about the reliability of test 2. Correction for th 
attenuation due to the measurement error in test 1 is made as follows: 


399 


400 
Appendix C 


_ 28 
Vo 

18 

8 

-60 


ll 


The estimate is that if test 1 were made perfectly reliable the correla- 
tion of .48 would, in another study, increase to .60. 

In the above formula we only considered the unreliability in one 
test. Of course, the second test also would not be perfectly reliable. 
Let us say in this case that the reliability is known to be .81. The fol- 
lowing formula allows us to estimate the correlation that would be ob- 
tained if both tests were made perfectly reliable: 


Tie 
Vru V r22 


Substituting the figures from the example we find: 


712 = 


48 
V 6181 
48 
8 * 9 
AS 


7 
67 


71 = 


If both tests were made perfectly reliable, the formula estimates that 
the correlation between the two tests would, in a subsequent study, be 
.67 instead of .48. 

These formulas are useful for estimating the “true” relationships be- 
tween variables, that is, if measurement error were not attenuating the 
relationships. One place in which the formulas are useful is in examin- 
ing the correlation between a predictor test f and its criterion c. Say in 
this case t is a test to select college freshmen and c is grades in col- 
lege. Because of the measurement error in c, the correlation of t with c 
underestimates how well the test actually works. Consequently, it is 
justifiable to make the correction for the unreliability in c by dividing 
the correlation between ¢ and c by the square root of the reliability of 
c. However, in this instance, it would not be as directly meaningful to 
make the double correction for attenuation by correcting for both t 
and e. That correction would offer only a promissory note as to how 
well the test would work if it were made perfectly reliable. 

Of course, perfect reliability is only a handy fiction, and to make 


401 
Statistical Appendix 


estimates on the basis of that assumption is only useful to guide our 
thinking relative to the usefulness of tests. A more down to earth prob- 
lem is to estimate how much a correlation would increase if the re- 
liability were increased by any particular amount. A formula for doing 
this is as follows: j 


F = 


= original reliabilities of the two tests 

new reliabilities 

= original correlation 

= estimated correlation after the reliabilities are 


where ri; and ra 
ry, and 


increased 
Assume (a) that the original correlation between the two tests is .36, 
(b) that the original reliabilities of the two tests are .64 and 49, and 
(c) after improving the two tests (for example, by making them 
longer), the reliabilities increase, respectively, to .81 and .64, Then 
the estimate of the correlation that would be obtained from the more 


reliable measures is as follows: 


The formula also can be used to estimate how the correlation would 
change if both of the reliabilities were decreased (for example, by 
shortening the tests) or if one of the reliabilities were increased and 
the other were decreased. 

2. Correlation and regression formulas. In the text, the basic 
formula for the correlation coefficient was stated in terms of standard 


scores as follows: 


24112 


Fi 


where Zi = standard scores on one test 
Io = standard scores on another test 
N = number of people in the study 


In using 15 formula, one simply multiplies standard scores for each 
erson on the two tests, sums these for all persons in the study, and 
bw : s „ a 
vides by the number of persons in the study. ; 
‘Although the correlation coefficient ultimately concerns the relation 
S 88 ae Raste . ht gee 2 8 i 
ship between two sets of standard scores, it is more convenient to com 


di 


402 
Appendix C 


pute the correlation from either deviation scores or raw scores. Essen- 
tially what these formulas do is to convert scores to standard-score 
form in the context of the calculations rather than require a prior 
transformation to standard-score form. The following formula can be 
used to compute the correlation from deviation scores: 


Etita 
7˙12 


NV 


522 


where x; = deviation scores on one test 
* = deviation scores on another test 


With this formula, one first multiplies the deviation scores for each 
person on the two tests and sums these over the number of people in- 
volved in the study. Then one squares the scores for each person on 
test 1 and sums the squared scores over people; the same is done for 
scores on the second test. These quantities are then inserted into their 
proper places in the formula. 

Actually, if an automatic calculator is available, it is easiest to com- 
pute correlations from raw scores, using the following formula: 


= IR) 


where X, = raw scores on one test 
X: = raw scores on another test 
number of people 


2 
ll 


Although the above formula may look complex, actually it is a straight- 
forward extension of the more simple appearing standard-score for- 
mula. Again it should be emphasized that all these formulae supply 
the same numerical results. They are different computational ap- 
proaches to obtaining the same statistic. 

After the correlation coefficient is computed, it is possible to obtain 
a best-fit line that most effectively summarizes the relationship be- 
tween the two variables being studied. Examples of such best-fit lines 
are shown in Figures 4-1 and 4-2. If variables are expressed in stand- 
ard-score form, the equation for determining the best-fit line is as 
follows: 


Z x21 


where Zi = scores on the predictor variable 
Z = estimated scores on the variable being predicted 
rig = correlation between the two variables 


Suppose that the correlation between two tests is .50. To estimate 
scores on one test from the other, standard scores are multiplied by 


403 
Statistical Appendix 


50. For example, if a student has a standard score of 2.00 on one test, 
the estimate is that he has a standard score of 1.00 on the other test. 
If a student has a standard score of —1.00 on one test, the estimate is 
that he has a standard score of —.50 on the other test. Because when 
the best-fit line is plotted for sets of standard scores the line must go 
through the origin, it is necessary to obtain only one other point to 
draw the best-fit line. 

If the variables are expressed in deviation-score form rather than 
standard-score form, the equation for the best-fit line is as follows: 


j 02 
* = 712 Tı 
01 


where x, = estimates of deviation scores 
xı = deviation scores on the predictor variable 
cı and øx = standard deviations of the variables 


If variables are plotted in terms of raw scores, the regression equa- 
tion is as follows: 


Ag 


Chom ra 22 (Xa — M) + Me 
01 


estimates of raw scores 
v scores on the predictor variable 


where 0 
XI = rav > 
M, and Ma = means of the variables 

3. Internal-consistency measures of reliability. When an alternate 
form of a test is not available, very useful estimates of the reliability 
can be obtained from formulas concerning the internal consistency of 
the test items. If a test is reliable, all the items should tend to measure 
the same thing and should correlate positively with one another. The 
higher items correlate with one another, the more reliable the test. To 
say it another way, if the items within a test correlate highly with one 
‘other, the whole test should correlate highly with an alternate form. 
making several reasonable assumptions, good estimates can be 
f how highly a test would correlate with an alternate form. The 
as follows: 


ano 
By 
made 0 ; 
basic formula is 


n 1 — 27 
Tit „ 012 


es rin reliability of the test 

wher n= number of items in the test 

oi? = squared standard deviation of the test 
p= proportion of students passing each item 
q = proportion of students failing each item 


\ 


404 
Appendix C 


In computing the reliability the major computational chore is to multi- 
ply the proportion of students passing each item by the proportion of 
students failing each item. For example, if .40 of the students pass the 
first item and .60 fail the first item, the product is .24, These products 
are computed and summed over all the test items. After this quantity 
is obtained, it is a simple matter to obtain the other terms for the 
formula. 

The formula above is called the Kuder-Richardson formula number 
20. For a more complete discussion of this and other formulas for esti- 
mating the reliability from the internal consistency of tests, see Guil- 
ford (38). It should be emphasized that what KR-20 does is to esti- 
mate the correlation between an existing test and a hypothetical alter- 
nate form. Usually the estimate is very good. When alternate forms are 
available, the actual correlation usually corresponds closely to the 
estimate given by KR-20. 

Some very rapid estimates of reliability are available which, al- 
though not usually as accurate as KR-20, provide useful information. 
These are discussed by Guilford. 

4. Measurement error—effects on score distributions, It will be re- 
called that measurement error has two effects on scores, First, it intro- 
duces a source of bias, scores above the mean being biased upward 
and scores below the mean being biased downward. Second, it intro- 
duces a zone of uncertainty, or error, for each score. It is helpful to 
obtain unbiased estimates of “true” scores and to assert confidence 
zones about those estimates, Unbiased estimates of true scores can be 
obtained as follows: 


, 
Ti = futi 


Il 


where x; = estimated true scores 

tı = deviation scores on the test 

ri = test reliability 
If an individual has a test deviation score of 10 and the reliability of 
the test is .90, the best estimate of his “true” score is 9. If another in- 
dividual has a test deviation score of —20, his estimated “true” score 
is —18. What the formula above does is to regress scores back toward 
the mean. The further out scores are in either direction, the more (in 
an absolute sense) they are pulled back toward the mean. 

It should be clear that the formula above does not change the rela- 
tive ordering of students. The top student with respect to obtained 
scores will remain the top with respect to estimated true scores, and 
the bottom student on obtained scores will remain there on estimated 
true scores. Because the relative ordering of students is not changed, 
it rightly can be asked why the formula serves any useful purpose. It 


405 
Statistical Appendix 


primarily is useful for asserting the center of confidence bands, which 
we now will consider. 

Although the formula above provides an estimate of “true” scores, it 
gives no indication of the amount of error entailed in the estimate. If 
we gave people many forms of the same test, their scores would vary 
somewhat from day to day. It would be expected that the scores for 
each would range about some typical value, that typical value being 
the estimated true score for the student. How widely scores range 
about the true score is an indication of the amount of measurement 
error, One way to gauge the amount of error would be to compute the 
standard deviation of obtained scores for each person. The larger the 
standard deviation, the more measurement error there would be. The 
standard deviation would be zero only if the test were perfectly re- 
liable, Of course, it is not possible to measure the amount of error 
exactly in this way because we never administer numerous forms of 
i st to people. However, by making some reasonable assump- 
can be made of the standard deviation of scores that 


the same te 
tions, an estimate 
would be obtained. The formula is as follows: 


Omeas = 01 VI — 111 


standard error of measurement 

cı = test standard deviation 

rı = reliability of test 1 

The standard error of measurement (SEM) is a special kind of stand- 
ard deviation which indicates the amount of error in a test due to un- 


Illustrating the computations, if a test has a standard devia- 


where on 


reliability. : ‘ 
tion of 10 and a reliability of 90, the computations would be as 
follows: 


daa = 10 N = 90 
10 X 3.16 


= 3.16 


ll 


In other words, if we actually gave a person numerous comparable 
forms of the test, the expected standard deviation of scores (SEM) 
would be 3.16 score units. 

The formulas for estimating tue scores and for obtaining the SEM 
can be brought together in an illustration of how confidence bands are 
set. Previously it was shown that with a test reliability of .90 the esti- 
mated true scores corresponding to obtained scores of 10 and —20, 
respectively, are 9 and —18. These estimated true scores would serve 
as the centers of confidence bands. In the previous example of a test 
with a standard deviation of 10 and a reliability of 90, the SEM was 


406 
Appendix C 


3.16. By marking off points above and below estimated true scores 
corresponding to numbers of SEMs, odds can be set regarding the 
probabilities of obtained scores on comparable forms exceeding speci- 
fied limits, Conventionally we work with a confidence band extending 
two SEMs above and two SEMs below estimated true scores. For the 
person with an estimated true score of 9, the confidence band extends 
from 2.68 to 15.32. For the person with an estimated true score of —18, 
the confidence band would extend from —24.32 to —11.68. Because it 
is not possible to obtain fractional scores on most tests, these numbers 
would be rounded, and the confidence bands would extend from 3 to 
15 and from —24 to —12, respectively. 

Because the SEM is a special kind of standard deviation, odds cor- 
responding to confidence zones can be found in the proportions of 
area under various regions of the normal curve shown in Appendix B. 
There it will be found that approximately 95 per cent of the cases lie 
between plus two and minus two standard deviations about the mean. 
In other words, by asserting the confidence band as ranging from two 
SEMs below to two SEMs above the estimated true score, we develop 
a zone which allows us to feel “95 per cent sure.” In other words, if the 
individual were administered 100 comparable forms of the test, the ex- 
pectation is that only 5 times out of 100 would obtained scores be 
either greater or less than the limits of the confidence band. 

Although it has been emphasized that obtained scores are biased 
and it is best to regress scores toward the group mean, in practice it is 
sometimes difficult to know exactly how this is to be done. The prob- 
lem is that sometimes we are not sure what mean should be the pivotal 
point for regressing scores. For example, if someone tells me that he 
tested a child and found an IQ of 110, what should I use as the mean 
toward which the score is to be regressed? Should I use the over-all 
mean of 100 found in the test standardization? Yes, that is the best 
thing to do if you have no other information about the child. But sup- 
pose I learn that the child comes from a school in which the average 
IQ is 115. Should I still regress the score toward a mean of 100? No, 
now that I have obtained additional information about the child, I 
should not blindly follow the original statistical rules for regressing 
scores. More sensible would be to consider the score as probably lower 
than the true score rather than higher. Admittedly this becomes a 
rather complex problem, but there is no way of avoiding the fact that 
additional information changes the estimates one makes about meas- 
urement error. Two general rules can be stated. First, if you have no 
other information about a child, regress the obtained score toward the 
mean of the standardization sample, e.g., toward an IQ of 100. If you 
do have additional information about a child which allocates him to a 
definite subgroup of the population, temper your judgment of the 


407 
Statistical Appendix 


need to regress the score toward the mean of the standardization 
sample. It would be difficult to give exact statistical rules for how to 
do this, but the additional information about the child should be con- 
sidered in making judgments about his score. 

Fortunately, the knotty conceptual problems above are not met in 
classroom situations. There it is eminently sensible to regress scores 
toward the mean of all the students in a particular class. There it is 
safe to bet that, on any kind of test, high scores tend to be biased 
upward and low scores biased downward. There the teacher has a 
well-defined group in which the measurement error concepts and 
formulas clearly apply. 

5. Median—computational procedures. As was stated in the text, 
strictly speaking the median is defined only for an unusual circum- 
stance. By its definition, the median is the point on the test continuum 
which separates the top 50 per cent from the bottom 50 per cent of 
the students. Obviously then a particular score, e.g., 18, could be the 
median only if half of the students score 19 or higher and half of the 
students score 17 or lower. In other words, it would not be possible to 
obtain the median in the strict sense if anyone scores exactly at the 
median, 18 in the example above. Usually it will be the case that nu- 
merous students will make the score corresponding to the median. An 
example was in the text where to declare a “whole” score the median 
would be misleading. Consequently, it is necessary to transform the 
“whole” score corresponding to the median to a fractional score which 
more nearly fits the definition of the measure. The computational pro- 
cedures can be illustrated with the following distribution of test scores. 


Score Number of students 
24 1 
23 3 
22 2 
21 4 
20 3 
19 1 
18 6 
17 4 
16 4 
15 3 
14 1 
13 3 
12 1 
11 1 
10 1 


The computation of the median begins by determining the number of 
students in the distribution, which in the example above is 38, This is 


408 
Appendix C 


then divided by 2, giving 19. One then counts up from the bottom to 
find the score made by the 19th student, which in this case is 18. This 
might be used as the approximate median, but it is only an approxima- 
tion because six students make scores of 18. 

The best way to consider the problem is to think of a score of 18 as 
lying in a band from 17.5 to 18.5 and to think of the median as lying 
somewhere in that band. How far up in the band the median lies de- 
pends on the proportion of the students making the score (18) that 
must be counted in order to include 50 per cent (here, nineteen) of 
the students. In this case it is necessary to count only one of the six 
students who make scores of 18, Consequently, one-sixth is added to 
the lower bound of the category (17.5). This gives a median of 17.67. 
Admittedly those computations would not be employed unless the 
problem were important. However, because teachers will see frequent 
mention of the median with respect to commercially distributed tests 
and research reports, they need to know how the measure is com- 
puted. 

6. Sampling error of correlation coefficients. In the text it was em- 
phasized that sampling error is involved in obtaining any correlation 
coefficient. This is because correlations usually are obtained from only 
a sample of all the students that conceivably could be studied. For ex- 
ample, if the correlation between an intelligence test and an achieve- 
ment test is obtained on 50 students, this is only a tiny fraction of all 
the students in the county on whom the correlation conceivably could 
be computed. Consequently, it is not safe to regard any correlation as 
an exact value but rather as lying in a confidence band extending above 
and below the obtained value. The width of the confidence band is in- 
versely proportional to the number of students in the study. If only 10 
students are in the study, it is hard to place any confidence at all in 
the correlation. If 100 students are in the study, the zone of uncertainty 
about the correlation is much less. If 10,000 students are used to com- 
pute the correlation coefficient and these actually have been sampled 
from the population, the zone of uncertainty is so small that, for all 
practical purposes, the obtained correlation can be taken as almost 
identical to the value that would be obtained if the whole population 
were studied. 

How to obtain confidence bands can be illustrated with the situa- 
tion in which the population correlation actually is zero. That is, if all 
the students in the country were measured in order to compute the 
correlation, the actual value would be found to be zero. What would 
happen if we sampled only 100 students instead of measuring the 
whole population? Would the correlation in that sample he exactly 
zero? If we drew different samples of 100 students, would all the cor- 
relations be exactly zero? No, they would range about zero, some of 


409 
Statistical Appendix 


them being positive and some of them being negative. This would hap- 
pen because of the chance factors involved in drawing samples of stu- 
dents, Expected in this instance would be an approximate normal dis- 
tribution of correlations, with the mean at zero. How widely sample 
correlations ranged about zero would depend on the number of stu- 
dents in each sample. With only 10 students, the standard deviation 
of sample correlations would be very large; with 100 students, the 
standard deviation would be smaller; with 10,000 students the standard 
deviation would be so small that it could be overlooked altogether, 

A standard deviation of sample values is called a standard error. 
After the standard error is obtained, it can be used to set confidence 
zones for interpreting correlations. First, we will show how a standard 
error is obtained and then show how it is used to set confidence zones. 
A standard error for the correlation coefficient can be computed as 


follows: 


standard error of a correlation coefficient 


where c, = j i 
number of students being studied 


N= 
Suppose that a correlation is obtained on 101 students (to pick a con- 
ee number). The standard error would be: 


1 


vioi- 1 

= 10 
In other words, with an N of 101, the estimated standard deviation of 
e correlations (in a population where the true value is zero) is 


Or = 


sampl 
E? pow that with an N of 101 a correlation of .40 is obtained. A 
P could be set by marking off correlation points so many 


se zone 

onficlence oy rene 2 

0 dard errors above and so many standard errors below the obtained 
a 


5 Frequently we employ a confidence zone of two standard 
coi i above and below the obtained correlation. In Appendix B it is 
a that approximately 95 per cent of the cases lie within two stand- 
ard deviations (standard errors here) of the mean. Here the confidence 
ard 1 would extend from correlation values of 20 to 60. The meaning 
ne + we can feel 95 per cent sure that the real correlation lies some- 
js tha between .20 and .60. 
Suppose with a sample of 101 students the obtained correlation is 
15. Then the confidence band extends from —.05 to .35. In other 
in different samples, 5 times out of 100 the correlation might 


star 


ba 


only 
words, 


410 
Appendix C 


be less than —.05 or greater than .35. When the confidence band 
crosses zero, it means that it is not safe to conclude that the real cor- 
relation is other than zero. In that instance we say that the correlation 
is “not statistically significant” and withhold judgment about the cor- 
relation until more students are studied, 

Here we have illustrated the sampling error of correlations only in 
the simplest case. More complex procedures are required to study the 
sampling error of the difference between two correlations obtained on 
different samples, of the difference between two correlations obtained 
with the same sample, and others. Also, the reader should be warned 
that the formula given above for the standard error of the correlation 
is only an approximation. Although it works fairly well in many in- 
stances, more exact formulas are available. For more exact estimates of 
standard errors and for a more complete discussion of how to take ac- 
count of sampling error, the reader should consult texts on statistics. 
Some good texts on statistics are listed in the Suggested Additional 
Readings at the ends of Chapters 3 and 4. 

More important for teachers than to know all the statistical pro- 
cedures concerning the sampling error of correlations is to adopt appro- 
priate attitudes toward correlations reported in the research literature 
and correlations computed in school settings. First, it is important to 
realize that there is some sampling error connected with any correla- 
tion coefficient. Second, it is wise to be suspicious of correlations un- 
less they are based on a relatively large number of students. Usually, 
unless at least one hundred students are being studied, there is so 
much sampling error that the real value of the correlation is highly in 
doubt. Third, it is important to realize that statistical procedures are 
available for asserting confidence bands for correlations, and when the 
need arises, the necessary statistics can be obtained from available 
texts. 

7. Spearman-Brown prophecy formulas. As was mentioned at a 
number of places in the text, when the number of items in a test is in- 
creased, the reliability tends to increase. By making some reasonable 
assumptions, a formula can be obtained to estimate how much the re- 
liability will increase as a function of the increase in test length. The 
formula is as follows: 


Tan = 1 in = Dra 


where ran = estimated reliability of a test n times as long as the 
original test 
ru = reliability of the original test 
n = number of times the test is lengthened 


411 
Statistical Appendix 


If the reliability of a twenty-item test is .70, and if forty similar items 
are added to the original twenty, the estimate of the reliability of the 
sixty-item test is as follows: 


M 3(.70) 

~ 1+ 6 — 1).70 
pA 

2.4 

= .88 


Tan 


Whereas the twenty-item test has a reliability of .70, the estimate is 
that the reliability will increase to. 88 by adding forty items (by mak- 
ing the eventual test three times as long as the original). 

This general formula can be used both to estimate how much the 
reliability will increase if more items are added to a test and also to 
estimate how many items will be required to reach a specified level of 
reliability. Thus, if a test has a reliability of .75, and a reliability of. 90 
is needed, these two values can be entered in the formula (substituting 
.90 for %), and the solution will estimate the number of times the test 
must be lengthened. 

It should be pointed out that in the formula n need not be a whole 
number. It can be a fractional number. For example, the formula can 
be used to estimate the increase in reliability obtained from adding 
twenty more items to a forty-item test. In this case n would be 1.5. 

The formula also can be used to estimate the reliability of a short- 
ened test. In some instances it is helpful to have a shorter version of a 
test to use for rough screening purposes. If the reliability of the longer 
test is known, it is helpful to have an estimate of the reliability of the 
shorter version. This can be accomplished by making n the ratio of the 
number of items in the shorter test to the number of items in the longer 
test. For example, if the number of items in the longer test is forty and 
the number of items in the shorter test is twenty, then an n of .5 is 
entered in the formula. i 

The general formula takes on a simple appearance in the special 
case where it is necessary to estimate the reliability of a test doubled 
in length: N 


27777 


T+ rian 


fyo = 


where ra, = reliability of half the eventual items 
reliability of the total (twice as many) items 


ll 


Ti 


This version of the formula is particularly useful in studying the re- 


liability of a test by the split-half method. As was mentioned in the 


412 
Appendix C 


text, one way to measure reliability is to split the items within a test 
into two parts and correlate scores obtained on the two parts. The 
most popular procedure is to obtain separate scores on the odd- and 
even-numbered items. The two scores are then correlated. However, 
a correction must be made to obtain the reliability of the whole test, 
not just of the half tests. The formula above provides the necessary 
correction. For example, if the correlation of the split halves is .80, the 
correction is as follows: 


204.80) 
80 
Bei 
~ 1.80 

= .89 


What the formula does is to estimate how much the total collection of 
items would correlate with another collection of similar items of the 
same size. 

A caution should be heeded in applying the above formulas. The 
formulas assume that the items to be added (or taken away ) are simi- 
lar to those in the original test (or those in the remaining collection 
after the test is shortened). If the items to be added or taken away are 
grossly different from the original or remaining (in a shortened test) 
items, the formulas give misleading results. For example, in lengthen- 
ing a test, this would occur if the new items were either much easier 
or much harder than the original ones, or if the new items concerned 
different factors of ability or personality. Although it is useful to keep 
these assumptions in mind, it seldom is the case that they are grossly 
violated in actual work with tests. Consequently, the formulas usually 
supply very good estimates of the effect on reliability of either 
lengthening or shortening tests. 

8. Standard deviation—computational approaches. In the text, the 
basic formula for the standard deviation was given as follows: 


5 5 


Sr? 


S 9 N 


where ¢ = standard deviation of the test 
x = deviation scores on the test 
N = number of students 


To compute the standard deviation, the first step is to subtract the 
mean from all raw scores. The resulting deviation scores are then 
squared and summed. The sum is divided by the number of scores, 
The square root of this quantity is the standard deviation. 

Although it is easier to discuss the standard deviation in terms of 


413 
Statistical Appendix 


deviation scores, if an automatic calculator is available, it is easier to 
compute the standard deviation from raw scores as follows: 


Several tips are helpful in thinking about and in computing the 
standard deviation. First, the standard deviation obtained from the 
raw-score formula above is not changed if a constant is either added to 
or subtracted from all the scores. Consequently, it is permissible to add 
or subtract a constant from all the numbers before entering them in 
the raw-score formula. Suppose that the lowest test score is 40, then 
the computation of the standard deviation would be simplified by first 
subtracting 40 from each score. The resulting numbers would be easier 
to work with in calculating the standard deviation. This rule can also 
be used to get rid of negative numbers. If points are subtracted for 
making certain types of errors, some test scores are likely to be nega- 
tive. These negative numbers can be avoided by adding a positive 
quantity equal to the largest negative score to each of the scores. This 
will not change the standard deviation. 

Another tip is that if all the scores are multiplied by a constant, the 
standard deviation is multiplied by the same constant. For example, if 
all the scores are multiplied by two, the standard deviation obtained 
from the raw-score formula will be twice as large as that which would 
have been obtained from the original numbers. This fact allows us to 
throw out decimal points in the computation of the standard deviation. 
For example, if we have scores like 2.3, 1.6, and 3.8, these can be 
converted to 23, 16, and 38, The obtained standard deviation can then 
be divided by 10. 

Of course, there is no point in obtaining the standard deviation of 
a set of standard scores. By definition, a set of standard scores has a 
standard deviation of 1.00. 

9. Transformations of objective test results to point scales. In order 
to provide a uniform basis for combining results from objective tests 
with those from essay tests, it was suggested in the text that a point 
scale be used. Illustrations were given of the use of a five-point scale 
for grading essay examinations, term papers, and others. Such a five- 
point scale can easily be applied to all evaluational material except for 
objective test results. In grading objective tests it is necessary to specify 
grade levels in terms of percentages of items correct, For example, an 
illustration was given where 85 per cent of the items correct was con- 
sidered as an A grade, 72 per cent correct as a B grade, ete. 

The following figure illustrates how percentage-correct grades on 
objective tests can be transformed to a point scale: 


414 
Appendix C 


5.0 


Point scale 


1.0 1— L —.— =i J 
25% 50% 60% 12% 85% 100% 
D C B A 


Per cent of items correct on objective test 


To construct such a figure, first mark off on the baseline the per- 
centages of items that correspond to various grade levels on the objec- 
tive test. On the five-point scale, 1.0 means a totally failing grade. Cor- 
responding to that on the objective test would be the per cent of items 
correct that a student could obtain purely by guessing, which is one 
divided by the number of alternatives for each item, In the figure it is 
assumed that multiple-choice questions are used with four alternative 
answers for each question. Consequently, a percentage correct of 25 
should correspond to a score of 1.0 on the five-point scale. Sometimes 
one finds a percentage-correct score of less than that which would be 
expected by chance, in which case it is probably wisest to give that 
student the lowest possible score on the point scale, here 1.0. 

At the upper end of the objective test continuum, 100 per cent cor- 
rect logically corresponds to the highest possible score on the point 
scale, in this example 5.0. It is important to note that in the figure, 
distances between grade levels are not all equal. For example, C and B 
are separated by twelve units (3.0, 3.1, etc., to 4.1), but B and A are 
separated by only three units (4.2, 4.3, and 4.4). To TOE accurately 
translate percentage scores to point scores, on the point-score con- 
tinuum the correct number of units should be made to separate the 


415 
Statistical Appendix 


grade levels. For example, the distance on the vertical scale of the 
graph from B to A should be divided into three equal segments. The 
lower bound of the first segment would be 4.2, the lower bound of the 
second segment 4.3, and the lower bound of the third segment would 
be 4.4. Similarly, the distance from C to A would be divided into 
twelve equal segments. 

After the graph is completed, it is a simple matter to translate per- 
centage-correct scores directly to a point scale. For example, it can be 
seen that a percentage correct score of 55 corresponds to a point score 
of 2.5, 

Although it may seem like a lot of work to construct a graph for 
translating percentage-correct scores to point scores, once the graph is 
constructed it can be used over and over as long as the same stand- 
ards of grading are used on both scales. Of course, if after some ex- 
perience with the graph it is decided to change grading standards on 
one or both scales, it would be necessary to construct a new graph. 

10. Transformations of score distributions. A problem that is often 
encountered in using psychological tests is that of transforming an ob- 
tained set of raw scores to a set with a particular mean and standard 
deviation. For example, it might be found that the mean of the ob- 
tained raw test scores is 40 and that the standard deviation is 5. In 
order to compare scores on the test with scores on another test, or in 
order to place the scores in an easily interpretable form, it might be 
desirable to transform the raw scores in such a way that the new 
scores have a mean of 50 and a standard deviation of 10. Transforma- 
tions of this kind can be performed with the following formula: 


Oo Go 


where X, = scores on the transformed scale 
X, = scores on the obtained scale: raw scores 
M., M, = means of X, and X, respectively 


To, o, = standard deviations of X, and XV respectively 


The formula can be applied to the problem illustrated above as 


follows: 


X. = 10X, — (195 40 — 50) 
= 2X, — 30 


By this transformation a raw score of 40 would be transformed to a 
score of 50, and a raw score of 25 would be transformed to a score of 
20, Because the formula is a linear transformation, it does not change 
the shape of the score distribution. 


appendix 


Commercially 
Distributed Tests 


Section 1: Comprehensive Achievement Test Batteries 


California Achievement Tests, 1957 Edition 

California Test Bureau 

Levels: Lower primary, grades 1 and 2 (90-110 min.) 
Upper primary, grades 3 and 4 (125-140 min.) 
Elementary, grades 4 to 6 (145-165 min.) 
Junior high level, grades 7 to 9 (180-190 min.) 


The test reports two scores in each of the three basic skill areas of 
reading, arithmetic, and language. Although the manual describes 
methods for obtaining diagnostic information about pupils, such in- 
formation is based on a relatively small number of items in many cases. 
At the primary and elementary levels, the test provides a good cover- 
age of the three basic skill areas. At the junior high level, the test does 
not provide enough information about achievement in content areas. 
Although items with respect to content areas are included, scores are 
not obtainable for different content areas. Because of the emphasis on 
core skills rather than content areas, the test is recommended mainly 
for students at the primary and elementary levels. 


Essential High School Content Battery 
Harcourt, Brace, & World, Inc. 

Levels: grades 9 to 13 

Testing time: 200-225 min. 


The battery covers four fields: mathematics, science, social studies, 
and English. In general, the tests appear to have been carefully de- 
signed and constructed, Although over-all scores on the test should 
provide good indications of the progress of students, some of the sub- 
test reliabilities are rather low, and it is hazardous to seek diagnostic 
information from differences in scores within the test. Because of the 


416 


417 
Commercially Distributed Tests 


test content, it probably is a more useful measure for students in gen- 
eral or college preparatory curricula than for students in technical and 
commercial curricula. 


Iowa Tests of Basic Skills 
Houghton Mifflin Company 
Levels: grades 3 to 9 
Testing time: 280-325 min. 


This is a very thorough battery of tests. Content areas are not covered; 
rather the tests are aimed at the core skills of reading, language, arith- 
metic, and study skills. The battery provides fifteen scores: vocabulary 
(1), language (5), reading comprehension (1), study skills (4), arith- 
metic (3), and total score. Reliabilities of all subtests are good. An un- 
usual feature is that norms are provided for beginning, middle, and 
end of the school year periods. The manuals provide very clear instruc- 
tions for administering and using the test. Apparently the tests were 
very carefully designed and constructed. Unless there is a need to test 
for content areas, the battery provides an excellent measure of core 


skills. 


The Iowa Tests of Educational Development 
Science Research Associates 

Levels: grades 9 to 13 

Testing time: 459-480 min. 


The battery provides ten scores: understanding of basic social concepts, 
general background in the natural sciences, correctness and appro- 
priateness of expression, ability to do quantitative thinking, ability to 
interpret reading materials in the social sciences, ability to interpret 
reading material in the natural sciences, ability to interpret literary 
materials, subtotal score, and uses of sources of information. Unques- 
tionably the battery provides excellent measures of achievement at 
the secondary level. Tests were carefully designed and composed. 
Norms are based on large, representative samples of students. Manuals 
are clearly written and provide much information useful to teachers. 
The test publisher provides a scoring service which not only gives re- 
sults for individual pupils but also gives statistical summaries of re- 
sults from each school. The battery exemplifies achievement measure- 


ment at its best. 


Metropolitan Achievement Tests, 1959 Edition 
Harcourt, Brace, & World, Inc. 

Levels: Primary I, grades 1.5 to 2.5 (100 min.) 
Primary II, grades 2 to 3.5 (115 min.) 

Elementary, grades 3 to 4 (175 min.) 


418 
Appendix D 


Intermediate, grades 5 to 6 (280 min.) 
Advanced, grades 7 to 9 (290 min.) 


The content of this battery is outlined in detail in Chapter 9. At the 
younger levels the test primarily concerns core skills; at higher levels 
tests also are included for content areas. The test can be recommended 
on many points including (a) careful design of content and construc- 
tion of items, (b) clear and frank manuals, and (c) practicality of 
administration and scoring. 


Sequential Tests of Educational Progress (STEP) 
Cooperative Test Division 
Educational Testing Service 
Levels: Level 4, grades 4 to 6 
Level 3, grades 7 to 9 
Level 2, grades 10 to 12 
Level 1, grades 13 to 14 
Testing time: 450-500 min. 


At each level the battery contains seven tests: (1) reading, (2) writ- 
ing, (3) mathematics, (4) science, (5) social studies, (6) listening, 
and (7) essay. All the tests with the exception of the essay are com- 
posed of multiple-choice items. In the essay test, the student is asked 
to write a composition on a specified topic. The essay is scored by the 
classroom teacher, 

The most noteworthy feature of the STEP is that the test items are 
aimed more at the over-all goals of instruction rather than at the 
mastery of particular topics. The items principally concern how well 
students can use their school training to seek answers and to solve 
problems. Some of the items are very cleverly composed. Whereas on 
one hand it can be argued that, by aiming at the major end products of 
education, the STEP is more uniformly fair to students in different 
schools; on the other hand. it may be hard for some schools to see how 
their instruction is directly related to the items. By emphasizing end 
products of education rather than obvious course content, the STEP is 
a significant departure from the other major comprehensive batteries. 
Considerable experience will be required to determine whether this 
new emphasis will be widely accepted. Aside from the nature of the 
item content, the STEP shares many of the features of other major 
achievement test batteries, including careful standardization, high re- 
liability, and detailed reporting of norms. 


SRA Achievement Series 

Science Research Associates 

Levels: Grades 2 to 4 (95-125 min.) 
Grades 4 to 6 (335-445 min.) 
Grades 6 to 9 (300-375 min.) 


419 
Commercially Distributed Tests 


The battery provides measures of vocabulary, reading comprehension, 
language, arithmetic, and, at the higher levels, study skills. At the two 
upper levels, tests are long and provide broad coverage of material 
relating to core areas of instruction. One noteworthy feature is that on 
the test for grades 2 to 4, part of the reading comprehension material 
concerns concepts essential to reading, which is much like the content 
found on reading-readiness tests. The tests at each age level are some- 
what more difficult than those found on other achievement test bat- 
teries. Consequently, they will be more appealing to schools that have 
above average students, but they would serve rather poorly to provide 
diagnostic information about slow learners. The items apparently were 
very carefully constructed. Reliabilities of individual tests are good. 
Because of the generally high correlations among subtests, most of the 
information from the test is given in one total score. Manuals for the 
tests are very clear and detailed. 


Stanford Achievement Tests 
Harcourt, Brace, & World, Inc. 
Levels: Primary, grades 1.9 to 3.5 
Elementary, grades 3.0 to 4.9 
Intermediate, grades 5 to 6 
Advanced, grades 7 to 9 
Testing time: 80-123 min. 


In terms of content and item type this is one of the more conservative 
achievement batteries. At all levels, reading, spelling, and arithmetic 
are tested. Language skills are measured in all tests except the Pri- 
mary. Social studies, science, and study skills are measured in the In- 
termediate and Advanced batteries. All the tests are well constructed, 
and they have been improved during a series of revisions. The test 
good. Some may feel that the tests for content areas are 


manual is very 
ard simple factual information. 


too heavily oriented tow 


Section 2: Reading Achievement Tests 


Durrell-Sullivan Reading Capacity and Achievement Tests 
Harcourt, Brace, & World, Inc. 
Levels: Primary, grades 2.5 to 4.5 
Intermediate, grades 3 to 6 
Testing time: 45 min. 


Five scores are obtained: (1) word meaning, (2) paragraph meaning, 
(3) spelling, (4) written recall, and (5) total. A noteworthy feature 
of the test is that part of the materials are given orally by the teacher, 
and the remainder is read by the student. This provides information 
about discrepancies between ability to comprehend oral and written 


420 
Appendix D 


language. In general this is a good test that is primarily useful for 
measuring over-all achievement in reading but also provides some 
diagnostic clues about difficulties of particular students. 


Gates Basic Reading Tests 

Bureau of Publications 

Teachers College, Columbia University 
Levels: grades 3 to 8 

Testing time: 70-80 min. 


In the battery are five subtests: (1) reading to appreciate general sig- 
nificance, (2) reading to understand precise directions, (3) reading 
to note details, (4) reading vocabulary, and (5) level of comprehen- 
sion. The battery generally is well constructed and standardized. In- 
structions for administering and interpreting the battery are simple 
and quite clear. 


Gates Primary Reading Tests 

Bureau of Publications 

Teachers College, Columbia University 
Testing time: 25-30 min. 


The battery contains three types of tests: (1) word recognition, (2) 
sentence reading, and (3) paragraph reading. The tests are carefully 
constructed and highly reliable. The manual provides detailed instruc- 
tions for administering and interpreting the tests and gives excellent 
suggestions to teachers for remedial training of students with reading 
difficulties. 


Towa Silent Reading Test: New Edition, Revised 
Harcourt, Brace, & World, Inc. 
Levels: Elementary, grades 4 to 8 
Advanced, grades 9 to 13 
Testing time: 50-60 min. 


At both levels the following scores are obtained: (1) rate of compre- 
hension, (2) directed reading, (3) word meaning, (4) paragraph 
comprehension, (5) sentence meaning, and (6) location of informa- 
tion. In the Advanced battery a seventh test is poetry comprehension. 
Although in general the tests are good, they are all speeded; therefore, 
speed of reading and comprehension are emphasized. The tests prob- 
ably would give a faulty picture of the performance of a student who 
reads well but slowly. Otherwise the tests are well constructed and 
standardized. 


421 
Commercially Distributed Tests 


Kelley-Green Reading Comprehension Test 
Harcourt, Brace, & World, Inc. 

Levels: grades 9 to 13 

Testing time: 65-75 min. 


The test obtains scores for four types of reading skills: (1) selecting 
the central idea, (2) reading carefully and skimming for details, (3) 
drawing inferences from what is read, and (4) remembering details. 
Good norms are provided for high school levels. 


Nelson-Denny Reading Test, Revised Edition 
Houghton Mifflin Company 

Levels: grades 9 to 12, adult 

Testing time: 3540 min. 


The test provides four scores: (1) vocabulary, (2) paragraph compre- 
hension, (3) reading rate, and (4) total. A noteworthy feature of the 
test is that it is easily administered and scored. The test is too brief to 
provide truly diagnostic information about reading difficulties. Its pri- 
mary use is for a relatively quick appraisal of over-all reading skill. 


Reading Comprehension: Cooperative English Test 
Cooperative Test Division 

Educational Testing Service 

Levels: grades 9 to 12 

Testing time: 40-45 min. 


The test provides scores for (1) vocabulary, (2) speed of comprehen- 
sion, (3) level of comprehension, and (4) total. Extensive research 
with this test demonstrates that it is a good predictor of school achieve- 
ment. One of the best features of the test is that it attempts to measure 
subtle aspects of reading comprehension that are not measured by 
some other tests. This is one of the best reading achievement tests for 


high school students. 


Section 3: Group Tests of General Intelligence 


Chicago Nonverbal Examination 
Psychological Corporation 
Levels: age 7 to adult 

Testing time: 40 min. 


The test consists entirely of pictorial and symbolic material that re- 
quires little, if any, language usage. It can be administered either with 
oral instruction or, for those who have a severe language deficit, the 


422 
Appendix D 


test can be administered entirely by pantomime. Although tests of this 
type are not the best measures of general intelligence for most pur- 
poses, they have an important place with specific types of students. 
They are particularly useful with children who have a severe language 
handicap, such as the deaf and children who recently have immigrated 
from other countries. Also, nonlanguage tests of this type are useful 
with children in this country who have led culturally impoverished 
lives. 


Cooperative School and College Ability Tests (SCAT) 
Cooperative Test Division 

Educational Testing Service 

Levels: grades 4 to 6, 6 to 8, 8 to 10, 10 to 12, and 12 to 14 
Testing time: 60-75 min. 


At all levels the test yields three scores: (1) verbal, (2) quantitative, 
and (3) total. Although the total score provides most of the informa- 
tion obtainable from the test, if students score very differently on the 
verbal and quantitative portions, it indicates areas of unevenness in 
educational development. Generally the test is well constructed and 
standardized. The manual provides clear instructions for administering 
and interpreting the test. The major fault that some may find with the 
test is that it strongly emphasizes school-learned material rather than 
more abstract aspects of intelligence. 


Henmon-Nelson Tests of Mental Ability, Revised Edition 
Houghton Mifflin Company 

Levels: grades 3 to 6, 6 to 9, 9 to 12, and adult 

Testing time: 30-50 min. 


The test provides only one total score. Although it is a short test, it 
correlates well with longer tests of general intelligence and with 
achievement tests. The test is almost entirely concerned with verbal 
ability (as most tests of general intelligence are). Good norms are 
available for the test. Test reliabilities are high. The test provides a 
reasonably good, quick estimate of scholastic aptitude. 


Kuhlman-Anderson Intelligence Tests, Sixth Edition 
Personnel Press 


Levels: kindergarten, grades 1, 2,3, 4,5, and 6 
Testing time: 30-45 min. 


A single IQ is obtained from numerous separate subtests. Although 
the over-all IQ has satisfactory reliability, it would be unsafe to inter- 


423 
Commercially Distributed Tests 


pret differences among subtest scores. Although little actual reading is 
required, the test mainly measures verbal comprehension. This is one 
of the best tests of general intelligence available for young children. 


Kuhlman-Anderson Intelligence Tests, Seventh Edition 
Personnel Press 

Levels: grades 7 to 9, and 9 to 12 

Testing time: 35—40 min. 


This test is an extension of that used at earlier grade levels. It bears 
many of the characteristics of the test for earlier levels, including a 
strong emphasis on verbal comprehension. Scores are given for (1) 
verbal, (2) quantitative, and (3} total. The verbal and quantitative 
y that little information can be obtained from 
two parts. As is true of the test for younger 
rapid estimate of scholastic aptitude. 


portions correlate so highl 
comparing scores on the 
students, it provides a good, 


Otis Quick-scoring Mental Ability Tests 
Harcourt, Brace, & World, Inc. 
Levels: grades 1.5 to 4 
grades 4 to 9 
grades 9 to 16 
Testing time: 20-30 min. 


vity of these tests, they provide useful estimates of 
The tests are almost entirely concerned with verbal 
are very easy to administer and score. The test 


In spite of the bre 
scholastic aptitude. 
comprehension. They 
for grades 1.5 to 4 requires no reading. 


Pintner General Ability Tests, Nonlanguage Series 
Harcourt, Brace, & World, Inc. 

Levels: grades 4 to 9 

Testing time: 50-60 min. 


this one requires no reading and no 


Like other nonlanguage tests, 
test is primarily useful for children 


spoken or written language. The 
with a severe language handicap. 
Pintner General Ability Tests, Verbal Series 

Harcourt, Brace, & World, Inc. 

Levels: Primary, kindergarten to grade 2 
Elementary, grades 2.5 to 4.5 
Intermediate, grades 4.5 to 9.5 
Advanced, grades 9.0 and above 


P 6 E es 
Testing time: 45-55 min. 


424 
Appendix D 


At all levels this test bears many points in common with other group 
measures of general intelligence, including a very heavy emphasis on 
verbal ability. Generally the test is well constructed and standardized. 


SRA Tests of Educational Ability 
Science Research Associates 
Testing time: 30-55 min. 


The test attempts to measure three different aspects of scholastic apti- 
tude: (1) verbal, (2) reasoning, and (3) quantitative. However, not 
enough items are included to provide reliable measures of these three 
aspects, and it is much better to interpret only a total score for the 
three subtests. As an over-all measure of general intelligence, the test 
should take its place with other good measures of that kind. 


Terman-McNemar Test of Mental Ability 
Harcourt, Brace, & World, Inc. 

Levels: grades 7 to 12 

Testing time: 40-45 min. 


This test bears many of the characteristics of other verbal tests of in- 
telligence. It is almost entirely concerned with verbal comprehension. 
The test was very carefully designed and constructed. 


Section 4: Interest Inventories 


Brainard Occupational Preference Inventory 
Psychological Corporation 

Levels: grades 8 to 12, adult 

Testing time: 30 min. 


Covered in the inventory are six broad occupational fields: (1) com- 
mercial, (2) mechanical, (3) professional, (4) esthetic, (5) scientific, 
and (6) personal service (for girls) or agriculture (for boys). The in- 
ventory is simple to administer and score. At the present time, no evi- 
dence on the validity is available. 


Kuder Preference Record—Occupational 
Science Research Associates 

Levels: grades 9 to 16, adult 

Testing time: 25-30 min. 


This is one of three interest inventories by Kuder. All three share the 
same type of test item and test format. The purpose of this inventory is 
to provide scores for thirty-eight specific occupations. The average re- 
liability of the occupational scores is only about .60, which is too low 
for use in counseling students. The instrument needs to be further 


425 
Commercially Distributed Tests 


standardized and validated before it is acceptable in high school 
counseling programs. 


Kuder Preference Record—Personal 
Science Research Associates 
Levels: grades 9 to 16, adult 
Testing time: 40-45 min. 


The purpose of this interest inventory is to measure five personal char- 
acteristics that potentially are important for occupational choice: (1) 
being an active participant in group activities, (2) being in familiar 
and stable situations, (3) dealing with abstract ideas, (4) avoiding 
conflict, and (5) leading and directing others. Potentially this inven- 
tory could serve as a supplement to the Occupational and Vocational 
forms, principally the latter. Neither enough research evidence or 
practical experience has been had with the instrument to know how 
helpful it will be in high school counseling. 


Kuder Preference Record—Vocational 
Science Research Associates 
Levels: grade 9 and above 
Testing time: 30-50 min. 
See discussion in Chapter 15. For many years this has been one of the 
most widely used interest inventories in high school counseling pro- 
grams. In contrast to the Occupational form, this form provides scores 
in ten broad vocational areas rather than for many separate occupa- 
tions. The instrument is well designed and standardized. The ten 
scales each have moderately high reliability. Although not enough evi- 


dence is available about validity, the instrument is judged to be useful 


by many high school counselors. 


Strong Vocational Interest Blank for Men, Revised 


Stanford University Press 
Levels: age 17 and over 
Testing time: about 40 min. 


See discussion in Chapter 15. This is by far the most widely used in- 
terest inventory. It provides scores for numerous separate occupations 
as well as a number of global interest scores. Considerable research 
has been done with the inventory. Results show that most of the occu- 
are sufficiently reliable, that scores are predictive of 
some years after taking the test, and 
successful and unsuccessful 


pational scores 
occupations that students enter 
that some scales differentiate between 
people in occupations. The instrument is very useful in counseling 


high school students about future schooling and careers. 


426 
Appendix D 


Strong Vocational Interest Blank for Women, Revised 
Stanford University Press 

Levels: 17 years and older 

Testing time: about 40 min. 


This inventory for females is very similar to the form for men. It pro- 
vides scores on twenty-four occupations. Not nearly as much research 
has been done with this as has been done with the form for men. Be- 
cause of the similar methods of constructing both instruments, it is ex- 
pected that the female form will prove to be useful in high school 
counseling. 


Section 5: Personality and Adjustment Inventories 


Bell Adjustment Inventory 
Stanford University Press 
Levels: grades 9 to 16, adult 
Testing time: 25 min. 


The student form provides scores in four areas of adjustment: (1) 
home, (2) health, (3) social, and (4) emotional. This inventory suffers 
from all the difficulties that others do. It is dependent almost entirely 
on the individuals awareness of personal problems and his willingness 
to relate them. This inventory is primarily useful for the rough screen- 
ing of students who may need help with personal problems. It has been 
used for many years in conjunction with high school counseling 
programs, 


California Test of Personality, 1953 Revision 
California Test Bureau 
Levels: kindergarten to grade 3 
grades 4 to 8 
grades 7 to 10 
grades 9 to 16 
adult 
Testing time: 45-60 min. 


This is one of the few inventories that attempts to measure personality 
characteristics of young children. Scores are provided with respect to 
twelve personality characteristics, e.g., school relations, sense of per- 
sonal worth, and withdrawing tendencies. Also available is a total 
adjustment score and two subtotal scores relating, respectively, to so- 
cial and personal adjustment. Scores on the individual scales are far 
too unreliable for use. Consequently, only the total adjustment scores 
and the two subtotal scores should be used. Although there are some 


427 
Commercially Distributed Tests 


legitimate uses of personality inventories, there are so many problems 
in developing valid measures, and so much still is unknown about the 
meaning of responses to personality inventories with young children 
that this and other inventories should be used with extreme caution 
Inventories for young children should be administered and interpreted 


only by well-trained counselors. 


Gordon Personal Profile 
Harcourt, Brace, & World, Inc. 
Levels: grades 9 to 16, adult 
Testing time: about 20 min. 


s five scores: (1) ascendancy, (2) responsibility, 
(4) sociability, and (5) total adjustment 
vidence and practical experience are 
e in counseling high 


The inventory provide 
(3) emotional stability, 
score. Not enough research e 
available to say how useful the inventory will b 


school students. 


Guilford-Zimmerman Temperament Survey 
Sheridan Supply Company 

Levels: grades 9 to 16 and adult 

Testing time: 50 min. 


arose from extensive factor-analytic investigations of 
aring on personality inventories. Scores are pro- 
vided for ten personality factors. Although generally this is a well- 
constructed instrument, it may provide more information than can be 
interpreted by counselors. Research evidence still is lacking regarding 
its validity. This inventory may be useful in the hands of highly 
trained and experienced counselors but probably would prove difficult 


for less well trained and experie 


This inventory 
items typically appe 


nced counselors to use. 


Heston Personal Adjustment Inventory 
Harcourt, Brace, & World, Inc. 
Testing time: 40-50 min. 


vides six scores: (1) analytical thinking, (2) sociabil- 


(4) emotional stability, (5) confidence, and 
as well constructed and stand- 


The inventory pro 
ity, (3) home adjustment, 
(6) personal relations. It apparently W. 
ardized. 

Minnesota Multiphasic Personality Inventory (MMPI) 
Psychological Corporation 
Levels: 16 and over 
Testing time: about 60 min. 


428 
Appendix D 


The purpose of the inventory is to detect tendencies toward nine differ- 
ent forms of mental illness. The instrument is widely used by clinical 
psychologists and school psychologists. It requires considerable pro- 
fessional training and experience to interpret the results. Consequently, 
it is not wise to employ this inventory unless highly trained personnel 
are available. 


Mooney Problem Checklist 

Psychological Corporation 

Levels: grades 7 to 9, 9 to 12, 13 to 16, and adult 
Testing time: about 30 min. 


This is an old and very sensible instrument. Rather than purporting to 
be a test, in the formal sense of the word, it is intended to be used as a 
screening device for students with personal problems and as an aid to 
counseling. The inventory consists of a long list of problems typical of 
those that bother some students, The items were obtained from written 
statements of problems by over 4,000 students and from other sources. 
The problems concern social relations, home adjustment, health, finan- 
cial difficulties, sexual problems, religious difficulties, and others, In 
using the checklist, equally important to observing the number of prob- 
lems checked by each student is to study the particular kinds of prob- 
lems indicated. Wisely used, this checklist (and the two to be de- 
scribed next) have an important place in school counseling programs. 


SRA Junior Inventory 
Science Research Associates 
Levels: grades 4 to 8 
Testing time: about 45 min. 


See Chapter 16 for a discussion of this instrument. It is similar in item 
content to the Mooney Problem Checklist. Problems are sampled from 
five areas: (1) “about myself,” (2) “about me and my school,” (3) 
“about me and my home,” (4) “getting along with other people,” and 
(5) “things in general.” The good things said about the Mooney in- 
ventory also apply to this one. It probably is the best adjustment in- 
ventory available for students in elementary school. 


SRA Youth Inventory 
Science Research Associates 
Levels: grades 7 to 12 
Testing time: about 35 min. 


This is an extension of the Junior inventory for older students. What 
was said about the Junior inventory holds with equal force in regard 
to the Youth form. 


20. Crawford, J. E. an 


References 


. Allport, G. W. and Allport, F. H. The A-S Reaction Study: Revised Manual. 
Boston: Houghton Mifflin, 1939. : 

2. Andrew, D. M. and Paterson, D. G. 

York: Psychological Corporation, 1946. 

ancy. The California First-year Mental Scale. Univer. Calif. Syllabus 


Minnesota Clerical Test: Manual, New 


. Bayley, Na 
Ser., 1933, No. 243. 

Beck, S. J. Rorschach’s test. 
3 Vols. 

. Beckman, A. S. Minimum 
J., 1930, 9, 309-313. 

. Bellak, L. and Bellak, Sonya 
C. P. S. Company, 1955. 
Bennett, G. K. Hand-tool D 
Corporation, 1947. 
Bennett, G. K. Test of M 


York: Psychological Corp 
. Bennett, G. K., Seashore, H. G., and Wesman, A. G. Differential Aptitude 


Tests: Manual. New York: Psychological Corporation, 1947, 1952, 1959. 

. Binet, A. and Simon, T. Méthodes nouvelles pour le dianostic du niveau intel- 
lectuel des anormaux. Année Psychol., 1905, 11, 191-244. 

. Bisbee, E. V. Commercial Education Survey Tests: Junior and Senior Short- 
hand. Bloomington, IL: Public School, 1933. 

12. Bogardus, E. A social distance scale. Sociol. and Soc. Res., 1933, 17, 265-271. 

13. Bond, E. A. Tenth grade abilities and achievements. Teach. Coll. Contr. Educ., 


1940, No. 813. 

. Buros, O. K. Tests in print. Highland Park, N.J.: Gryphon Press, 1961. 

. Buros, O. K. The fifth mental measurements yearbook. Highland Park, N.J.: 
Gryphon Press, 1959. (First yearbook, 1938; second, 1940; third, 1949; 
fourth, 1953). 

Buswell, G. T. and John, Lenore. Diagnostic Chart for Fundamental Processes 
in Arithmetic: Manual of Directions. Indianapolis: Public School, 1925. 

. Cattell, Psyche. The measurement of intel infants and young children. 
New York: Psychological Corporation, 

Conrad, H. S. and Jones, 
intelligence: Environmenta 
sibling correlations in the total samp 
1940, Part 2, 97-141. 

ive Test Divisior 

ement Tests. Princeton, N.J.: Educ. Testing Serv., 1956. 

d Crawford, D. M. Small Parts Dexterity Test: Manual. 

poration, 1949. 

w York: Rinehart, 1947. 

Bloomington, III.: Public School, 


New York: Grune & Stratton, 1945, 1949, 1952, 
intelligence levels for several occupations. Personnel 
S. Children’s Apperception Test. New York: 
exterity Test: Manual, New York: Psychological 


echanical Comprehension, Form BB: Manual, New 
oration, 1951. 


ligence of 
1947. 
1 study of familial resemblance in 


H. E. A seconc 
] and genetic implications of parent-child and 
le. 39th Yearb., Natl. Soc. Stud. Educ., 


. Cooperat a, Educational Testing Service. Cooperative General 


Achiev 


hological Cor 
d deafness. Ne 
Test: Manual. 
ce Research.) 


New York: Psyc! 
Davis, H. (Ed) Hearing an 
Drake, R. M. Musical Memory 

1934. (Also distributed by Scien 


429 


430 


References 


23. Durost, W. N. (Ed.) Metropolitan Achievement Tests. Tarrytown, N. V.: 


39. 


Harcourt, Brace & World, 1960. 

. Edwards, A. L. The relationship between the judged desirability of a trait 
and the probability that the trait will be endorsed. J. appl. Psychol., 1953, 
37, 90-93. 

. Farnsworth, D. The Farnsworth Dichotomous Test for Color Blindness: 
Manual. New York: Psychological Corporation, 1947, 

. Farnsworth, D. The Farnsworth-Munsell 100 Hue Test for the Examination of 
Color Discrimination: Manual. Baltimore: Munsell Color Company, 1949. 

. Farnsworth, P. R. Rating scales for musical interests. J. Psychol., 1949, 28, 
245-253. 

Freeman, E. and Zaccaria, M. A. An illuminant-stable color-vision test, II. 
J. Opt. Soc. Am., 1948, 38, 971-976. 


Gesell. A. et al. Gesell Developmental Schedules. New York: Psychological 


Corporation, 1949, 


. Ghiselli, E. E. The validity of commonly employed occupational tests. Univer. 


Calif. Publ. Psychol., 1949, 5, 253-287. 
. Ghiselli, E. E. and Brown, C. W. The effectiveness of intelligence tests in the 
selection of workers. J. appl. Psychol., 1948, 32, 575-580, 


2. Gilliland, A. R. Northwestern Intelligence Tests. Test A, for infants 4-12 


weeks old, Boston: Houghton Mifflin, 1949, 

. Goodenough, Florence and Van Wagenen, M. J. Minnesota Preschool Scale. 
Form A and F, Minneapolis: Educ, Test Bur., 1940. 

. Gough, H. G. Minnesota Multiphasic Personality Inventory. In A. Weider 
(Ed.), Contributions toward medical psychology. New York: Ronald, 1953, 
Vol. 2. 

Graves, M. Design Judgment Test: Manual. New York: Psychological Corpora- 
tion, 1948, 

Gray, W. S. Standardized Oral Reading Paragraphs. Indianapolis: Public 
School, 1915. 

. Greene, H. A., Jorgensen, A. N., and Kelley, V. H. Iowa Silent Reading Tests. 
Tarrytown, N.Y.: Harcourt, Brace & World, 1956. 

. Guilford, J. P. Psychometric methods. (2nd ed.) New York: McGraw-Hill, 

1954. 
Guilford, J. P. and Zimmerman, W. S. The Guilford-Zimmerman Temperament 
Survey: Manual. Beverly Hills, Calif.: Sheridan Supply Company, 1955. 

. Hartshorne, H., May, M. A., and Shuttleworth, F. K, Studies in the organiza- 
tion of character, New York: Macmillan, 1930. 

is Hathaway, S. R. and McKinley, J. C. Minnesota Multiphasie Personality In- 
ventory. (Rev. ed.) New York: Psychological Corporation, 1951. 

. Holzman, W. H. Objective scoring of projective tests. In B. M. Bass and I. 
A. Berg (Eds.), Objective approaches to personality assessment. Princeton, 
N.J.: Van Nostrand, 1959. 

. Honzik, M., McFarlane, J., and Allen, L. The stability of mental test perform- 
ance between two and eighteen years, J. exp. Educ., 1948, 17, 309-324, 


4. Horn, C. C. Horn Art Aptitude Inventory: Manual. Chicago: Stoelting, 1953. 


. Horn, C. C. and Smith, L. F. The Horn Art Aptitude Inventory. J. appl. 
Psychol., 1945, 29, 350-355. A 

. Klineberg, O. Negro intelligence and selective migration. New York: Columbia, 
1935. 

Knauber, Alma J. Knauber Art Ability T : 
Ohio: Author, 1935. (Distributed by Psychological Corporation. ) 


Examiners Manual. Cincinnati, 


48. 


49. 


. Lewerenz, A. S. Tests in Fundamental 


2. Lorge, I. and Thorndike, 
. McAdory, M. The McAdory Art Tesi 
. Maxwell, W. 


. Meier, N. C. The Meier Art 


. Murra 
. Nunnally, J. C. Popular conceptions © 


. Osgood, C. E., Suci, G. 


2. Psychological Corporation 
Remmers, 
. Remmers, H. H. 
Richardson, Bellows, 


36. Roberts, J. A. F. Resemb! 


. Seashore, C. E., Lewis, . 
Scashore, R. H. and Hevner, 


. Stewart, N. A. G 


Strong, E. K. Vocational interests 18 yee” 


431 
References 


Kuder, G. F. Ruder Preference Record Vocational. Chicago: Science Re- 


scurch, 1956. 
Kwalwasser, J. and Dykema, P. W. Kwalwasser-Dykema Music Tests: Manual 
of Directions. New York: Carl Fischer, Inc, 1930, (Also distributed by 


Stoelting. ) 
Abilities of Visual Art: Manual of 


Directions. Los Angeles: Calif. Test Bureau, 1927. 


. Likert R. and Quasha, W. H. Revised Minnesota Paper Form Board Test: 


Manual, New York: Psychological Corporation, 1948. 
R. L. The Lorge-Thorndike Intelligence Tests. 


Boston: Houghton Mifflin, 1957. 
t: Manual. New York: Teachers College, 


1929. 
C. International Typewriting Tests. Minneapolis: Educ. Test 


Bur., 1950. 
Tests: I. Art Judgment, Manual. Iowa City, Iowa: 


Univer. Iowa, Bur. Educ. Res. Service, 1942. 
Melton, A. W. (Ed.) Apparatus Tests. AAF Aviation Psychology Program 
Research Reports, Rep. No. 4, Washington: GPO, 1947. 
y, H. A. Thematic Apperception Test. Cambridge, Mass.: Harvard, 1943. 
1 f mental health: Their development and 


change. New York: Holt, 1959. 
Nunnally, J. C. and Flaugher, R. 


sonality, 1963, 31, 192-202. 
J., and Tannenbaum, P. The measurement of meaning. 


Illinois Press, 1957. 
and Durost, W. Pintner-Cunningham Primary 


F. Correlates of semantic habits. J. Per- 


Urbana, Ill.: Univer. 


. Pintner, R., Cunningham, Bess, 
_ Yonkers, N.Y.: World, 1946. 


Test: Manual of Direction: 
General Clerical Test: Manual. New York: Psycho- 
logical Corporation, 1950. 


H, H. and Baue R. H. SRA Junior Inventory. Chicago: 


rnfeind, 


Science Research, 1957. ö : 
and Shimberg, B. SRA Youth Inventory. Chicago: Science 


Research, 1960. 
Senen and Company, Inc. SRA Mechanical Aptitudes: 


Henry, a 
Manual. Chicago: Science R earch, 1950. i 
ances in intelligence between sibs selected from a 


ban population. Proc. Intern. Genet, Congr., 1941, 


complete sample of an ur 
7, 252. 

and Saetveit, J. G. Seashore Measures of Musical 
Talents. (Rev. ed.) New York: Psychological Corporation, 1960. 
K. A. A time-saving device for the construction 
5. 1933, 4, 366-372. 


of attitude scales. J. soc. Psyc 
army personnel grouped by occupations. Occu- 


T. scores of 
pations, 1947, 26, 5-41. 

Stromberg, E. L. Stromberg Dexterity T 
Psychological Corporation, 1951. 


‘est: Preliminary Manual. New York: 


rss after college. Minneapolis: i 
Minnesota Press, 1955 8 eapolis: Univer. 

Strong, E. K. Stron V. ö terest Blank for Me 55 3 S 
Calif.: Stanford, 1 Intere for Men, Revised. Stanford, 


432 


References 


. Studies in visual acuity. PRS Report 742, Personnel Res. Sect., AGO, 1948, 
161. 

. Stutsman, R. Mental measurement of preschool children. Yonkers, N.Y.: World, 
1931. 

Symonds, P. M. Symonds Picture-story Test. New York: Teachers College, 
1948. 

. Taylor, C. W. (Ed.) Research conference on the identification of creative 
scientific talent. Ogden, Utah: Univer, Utah Press, 1961. (See presentation 
by J. P. Guilford.) 

- Terman, L. M. and Merrill, Maud. Stanford-Binet Intelligence Scale. Boston: 
Houghton Mifflin, 1960. 

. Tiegs, E. W. and Clark, W. W. California Achievement Tests. Monterey, 
Calif.: Calif. Test Bur., 1950 and 1957. 

. Thurstone, L. L. A factorial study of perception. Psychometr. Monogr., 1944, 
No. 4. 

Valentine, C. W. Intelligence tests for young children. London: Methuen, 
1945. 

. Watson, L. A. and Tolan, T. Hearing tests and hearing instruments. Baltimore: 
Williams and Wilkins, 1949. 

Wechsler, D. The measurement of adult intelligence. (3rd ed.) Baltimore: 
Williams and Wilkins, 1944. 

Wechsler, D. Wechsler Intelligence Scale for Children: Manual, New York: 
Psychological Corporation, 1949, 

Wechsler, D. Manual for the Wechsler Adult Intelligence Scale. New York: 
Psychological Corporation, 1955. 

Wing, H. D. Wing Standardized Tests of Musical Intelligence. (Rev. ed.) 
London: National Foundation for Educational Research in London and 
Wales, 1960. 

86. Woodworth, R. S. Personal Data Sheet. Chicago: Stoelting, 1918. 


Index 


Absolute responses, methods for 
measuring, 331-332 
Acceptability influence on tests, 
352-353 
Achievement and intelligence, 56- 
57, 171-173, 245-246 
Achievement tests, administration, 
195-197 
comprehensive batteries, 186-198 
construction of, 173-177 
content areas on, 193-195 
diagnostic, 203-210 
kinds of, 166-168 
outline of content, 173-176 
special topics, measures of, 199- 
210 
use of, 177-184 
in guidance and counseling, 
181 
in research, 183-184 
schedules for, 382-383 
Administration of tests, achieve- 
ment, 195-197 
diagnostic, 210 
Age norms, 51-52 
Allen, L., 270, 430 
Allport, F. H., 23, 429 
Allport, G. W., 23, 429 
Alternate-form reliability, 83-84 
American Educational Research 
Association, 185 
American Optical Co., 280 
Anastasi, Anne, 28, 58, 88, 185, 
303, 375 
Andrew, D. M., 293, 429 
Answer sheets for objective tests. 
141-142 
Arithmetic, testing for, 191-193 
Army General Classification Test, 
271. 431 


Art, aptitudes for, 294-304 

A-S Reaction Study, 429 

Assessment function of tests, 19-22 

Association as measure of person- 
ality, 359-360 

Attenuation of correlations, correc- 
tion for, 399-401 

effect of, 74-77 

Attitudes, measurement of, 334-342 

Audiogram, 281-282 

Audition, measurement of acuity in, 
280-282 

Average, measures of, 30-35 


Bass, B. M., 430 

Bauernfeind, R. H., 431 

Bausch and Lomb Co., 280 

Bayley, Nancy, 265, 429 

Bean, K. L., 185 

Beck, S. J.. 355, 429 

Beckman, A. S., 285, 429 

Behavioral tests, 368-370 

Bell Adjustment Inventory, 426 

Bellak, L., 359, 429 

Bellak, Sonya S., 359, 429 

Bennett. G. K., 233, 237, 238, 287, 
291, 429 

Bennett Mechanical Comprehension 
Test, 291, 429 

Berg, I. A., 430 

Best-fit line in correlation, 61, 
401-403 

Bias in scores from measurement 
error, 70-73 

Binet, A., 216-217, 250, 429 

Bisbee, E. V., 294, 429 

Blommers, P., 58, 88 

Bloom, B. S., 107 

Bogardus, E. A., 338, 429 


433 


434 
Index 


Bond, E. A., 429 
Brainard Occupational Preference 
Inventory, 424 
Brown, C. W., 272, 430 
Bureau of Publications, Columbia 
University, 420 
(See also Publishers of tests) 
Buros, O. K., 387, 388, 389, 429 
Buswell, G. T., 205, 429 


California Achievement Tests, 174- 
176, 416, 432 

California First-Year Mental Scale, 
265, 429 

California Test Bureau, 416, 426 

(See also Publishers of tests) 

California Test of Personality, 426- 
427 

Cattell Infant Intelligence Test, 265 

Cattell, Psyche, 265, 429 

Character in relation to personality, 


344-345 

Checklists of personal problems, 
347-348 

Chicago Nonverbal Examination, 
421 

Children’s Apperception Test, 359, 
429 


Clark, W. W., 432 

Clerical aptitude, 293-294 

Clerical speed, 234 

Color vision, 280 

Combining scores on different tests, 
158-164 

Commercial Education Survey Tests, 
429 

Complex Coordination Test, 288 

Confidence bands, for correlations, 
67 

for obtained scores, 70-73 

Conrad, H. S., 269, 429 

Construct validity, 22-24 

Content areas on achievement tests, 
193-195 

Content sampling and reliability, 79 

Cooperative English Test, 421 


Cooperative General Achievement 
Tests, 429 
Cooperative School and College 
Ability Tests, 422 
Correlation, coefficient of, 60-68, 
401-403 
Counseling and guidance, personnel, 
384-387 
use of, achievement tests in, 181 
intelligence tests in, 273-274 
Crawford, D. M., 286, 429 
Crawford, J. E., 286, 429 
Creativity, classroom exercises con- 
cerning, 318 
measurement of, 309-317 
nature of, 304-305 
traits relating to, 305-309 
Cronbach, L. J., 13, 28, 240, 303, 
375 
Cunningham, Bess, 262, 431 
Cureton, E. E., 28 


Dark vision, 373-374 

Davis, H., 281, 429 

Deduction factor, 223-224 

Deviation scores, 35 

Diagnostic achievement tests, 203- 
210 

Differential Aptitude Tests, 233- 
239, 291, 293, 429 

Difficulty of items, 132-134 

Discrimination as characteristic of 
test items, 135-137 

Dispersion, measures of, 35-37 

Drake Musical Memory Test, 297, 
429 

Drake, R. M., 297, 429 

Durost, W. N., 262, 430, 431 

Durrell-Sullivan Reading Capacity 
and Achievement Tests, 419- 
420 

Dykema, P. W., 431 


Easiness percentage, 132-134 
Ebel, R. L., 138 

Educational age, 52 
Educational Index, 390 


Educational Test Bureau, 267 
(See also Publishers of tests) 
Educational Testing Service, 418, 
421, 422, 429 
(See also Publishers of tests) 
Edwards, A. L., 353, 430 
Embedded figures, 228, 316, 373 
Essay items, rules for writing, 128- 
132 
Essential High School Content Bat- 
tery, 416-417 
Evaluation of performance, 148-165 


Factor analysis, 217-219 

Factors of intellect, 215-239 

Faculty psychology, 216 

Farnsworth, D., 280, 430 

Farnsworth, P. R., 298, 430 

Farnsworth Dichotomous Test for 
Color Blindness, 280, 430 

Farnsworth-Munsell 100 Hue Test, 
280, 430 

Figural relations, items relating to, 
243 

Fill-in items, 116-118 

Flanagan, J. C, 58 

Flaugher, R. L., 431 

Freeman, E., 280, 430 

French, W., 107 

Frequency distribution, 38-46 

Fruchter, B., 240 

Fundamental Processes in Arith- 
metic, 205-209 


Garrett, H. E., 58, 88 

Gates Basic Reading Tests, 420 

Gates Primary Reading Tests, 420 

General ability, measures of, 241- 
277 

General Clerical Test, 293-431 

General reasoning factor, 223 

Geography, outline of objectives for, 
99-101 

Gerberich, J., 139 

Gesell, A., 430 

Gesell Developmental Schedules, 
264-266, 430 


435 
Index 


Ghiselli, E. E., 272, 284, 430 
Gilliland, A. R., 265, 430 
Goodenough, Florence, 267, 430 
Gordon Personal Profile, 427 
Gough, H. G., 349, 430 
Grade norms, 53 
Grade placement, 177-179 
Grading of tests, 143-165 
Graves, M., 300, 430 
Graves Design Judgment Test, 300, 
430 
Gray, W. S., 204, 430 
Gray’s Oral Reading Paragraphs, 
204-205, 430 
Greene, H. A., 430 
Group tests of intelligence compared 
to individual tests, 246-248 
Guessing, corrections for, 142-143 
effect of, on objective items, 114- 
116, 118-121 
on reliability, 79-80 
instructions regarding, 142-143 
Guessing-who technique, 367-368 
Guidance and counseling, personnel, 
384-386 
use of, achievement tests in, 181 
intelligence tests in, 273-274 
Guilford, J. P., 88, 240, 350, 375, 
404, 430 
Guilford-Zimmerman Temperament 
Survey, 350-351, 427, 430 


Hagen, Elizabeth, 13, 107, 139, 165, 
303, 375, 393 

Hand-tool Dexterity Test, 287, 429 

Harcourt, Brace, & World, 263, 416, 
417, 419-421, 423, 424, 427 

(See also Publishers of tests) 

Hartshorne, H., 369, 430 

Hathaway, S. R., 348, 430 

Henmon-Nelson Tests of Mental 
Ability, 422 

Heredity, relation of, to intelligence, 
268-269 

Heston Personal Adjustment Inven- 
tory, 427 


436 
Index 


Hevner, K. A., 298, 431 

Hill, G. E., 393 

Holzman, W. H., 430 

Honzik, M., 270, 430 

Horn, C. C., 300, 301, 430 

Horn Art Aptitude Inventory, 300- 
301, 430 

Houghton Mifflin, 255, 264, 417, 
421, 422 

(See also Publishers of tests) 
Hours for testing in schools, 103- 


104 


Illuminant-Stable Color Vision Test, 
280, 430 
Indians, scores of, on intelligence 
tests, 269-270 
Individual tests of intelligence com- 
pared to group tests, 246-248 
Infants, tests for, 264-266 
Instability of scores over time, 77- 
78, 81-82, 87 
Instructions for tests, methods for 
improving, 105-106 
Intellect, factors of, 215-239 
Intelligence Test for Young Children, 
268, 432 
Intelligence tests, 241-277, 380-381 
Interests, measurement of, 325-334 
Internal-consisteney reliability, 86, 
403-404 ` i 
International Typewriting Tests, 431 
Iowa Silent Reading Tests, 420, 430 
Iowa Tests of Basic Skills, 417 
Iowa Tests of Educational Develop- 
ment, 417 
Item analvsis, 132-138 
Item writing, objective vs. essay, 
109-114 
types of, 114121 
rules for, 114-132 
skill at. 108-109 


John, Lenore, 205, 429 
Jones, H. E., 269, 429 
Jorgensen, A. N., 430 


Kearney, N. C., 107 

Kelley, V. H., 430 

Kellev-Green Reading Comprehen- 
sion Test, 421 

Kent Area Guidance Council, 393 

Keystone View Company, 280 

Klineberg, O., 269, 430 

Knauber, Alma J., 430 

Knauber Art Ability Test, 430 

Kuder, G. F., 328, 431 

Kuder Preference Record, 327-328, 
424-425, 431 

Kuder-Richardson formula, 403-404 

Kuhlman-Anderson Intelligence 
Tests, 422-423 

Kwalwasser, J., 431 

Kwalwasser-Dvkema Music Tests, 
431 


Language skills, measurement of, 
190-191 

Lewerenz, A. S., 431 

Lewis, D., 295, 431 

Likert, R., 290, 431 

Lindquist, E. F., 58, 88 

Lorge, I., 13, 263, 431 

Lorge-Thorndike Intelligence Tests, 
263-264, 388-389, 431 


McAdorv, M., 298, 431 

McAdory Art Test, 298, 431 

McFarlane, J., 270, 430 

McKinley, J. C., 348, 430 

Matching items, 118-119 

Mathematics, measurement of, 191- 
193, 201-202 

Maxwell, W. C., 294, 431 

May, M. A., 369, 430 

Mean, measure of average perform- 
ance, 34 

Meaningful memory factor, 225 

Measurement error, 68—87, 404—407 

Mechanical aptitude, 283-293 

Median, measure of average per- 
formance, 32-34, 407-408 

Meier, N. C., 298, 431 


Meier Art Judgment Test, 298-300, 
431 

Melton, A. W., 287, 431 

Memory factors, 224-225 

Mental age, 53 

Mental illness, 345, 348-350 

Mental Measurements Yearbooks, 
387-389, 429 

Merrill, Maud, 251, 252, 432 

Merrill Palmer Scale, 268 

Metropolitan Achievement Tests, 
176, 186-187, 417-418, 430 

Milholland, J. E., 388 

Minnesota Clerical Test, 293, 429 

Minnesota Multiphasic Personality 
Inventory, 348-350, 427-428, 
430 

Minnesota Paper Form Board Test, 
290, 431 

Minnesota Preschool Scale, 267, 430 

Misleads as alternative answers, 
134-135 

Mode, measure of average perform- 
ance, 32 

Mooney Problem Checklist, 428 

Motor dexterity, 285-289 

Multifactor test batteries, 233-239 

Multiple-choice items, rules for 
writing, 114-128 

Murray, H. A., 358, 431 

Music, aptitude for, 295-298 


National Council on Measurements 
Used in Education, 185 

Negroes, scores of, on intelligence 
tests, 269-270 

Nelson-Denny Reading Test, 421 

Noll, V. H., 185 

Normal distribution, 40-46, 396- 
398 

Norms, 49-57 

Northwestern Intelligence Tests, 
265, 430 

Number skills, outline of objectives 
for, 96-99 

Numerical computation factor, 222- 
223 


437 
Index 


Nunnally, J. C., 13, 88, 240, 337- 
338, 372, 431 


Objective items, rules for writing, 
114-128 
Objectives, outline of, 96-103 
Observation, use of, in measuring 
attitudes, 335 
in measuring personality, 364- 
367 
Obtained scores, with regard to 
measurement error, 69-73 
Ordering of alternative answers, 
124-125 
Ortho-Rater, 280 
Osgood, . E., 887, 431 
Otis Quick-scoring Mental Ability 
Tests, 423 
Outline of test content, 96-103 


Pair comparisons, use of, in attitude 
measurement, 334 
Paterson, D. G., 293, 429 
Peer ratings, 367-368 
Percentiles, 47-49 
Perceptual closure factor, 227-228 
Perceptual factors, 227— 
Perceptual speed factor, 227 
Performance tests, 248-250 
Personal Data Sheet, Woodworth’s, 
346-347, 432 
Personality, measurement of, 344 
375 
use of observation in, 364-367 
use of self-report in, 345-353 
use of sentence completion in, 
360-361 
ratings of, 365-367 
relation of character to, 344-345 
relations of physiology to, 370- 
371 
Personnel Press, 422, 423 
(See also Publishers, of tests) 
Physics, outline of objectives for, 
101-103 


438 
Index 


Physiology, relations to personality, 
370-371 
Pintner, R., 262, 431 
Pintner-Cunningham Primary Test, 
262-263, 431 
Pintner General Ability Tests, 423- 
424 
Prediction functions of tests, 15-18 
Preschool children, tests for, 266- 
268 
Problem checklists, 347-348 
Product-moment coefficient, 62 
Product scales, 301, 314-315 
Projective techniques, in attitude 
measurement, 339-340 
in personality measurement, 353- 
364 
Promotion from grade to grade, 179 
Psychological Abstracts, 390 
Psychological Corporation, 233, 259, 
260, 266, 286, 290, 291, 293, 
300, 421, 424, 427, 428, 431 
(See also Publishers of tests) 
Psychological Corporation General 
Clerical Test, 431 
Publishers of tests, 389-390, 395 
Bureau of Publications, Columbia 
University, 420 
California Test Bureau, 416, 426 
Educational Test Bureau, 267 
Educational Testing Service, 418, 
421, 422, 429 
Harcourt, Brace, & World, Inc., 
263, 416, 417, 419-421, 423, 
424, 427 
Houghton Mifflin Company, 255, 
264, 417, 421, 422 
Personnel Press, 422, 423 
Psychological Corporation, 233, 
259, 260, 266, 286, 290, 291, 
293, 300, 421, 424, 427, 428, 
43] 
Science Research Associates, 228, 
316, 317, 417, 418, 424, 425, 
428 
Sheridan Supply Company, 427 
Stanford University Press, 425, 
426 i 


Quasha, W. H., 290, 431 
Quotient scores, 53-54 


r, correlation coefficient, 61 

Range as measure of dispersion, 36 

Ranks as scores, 46-47 

Ratings of personality, 365-367 

Reading skills, measurement of, 189- 
190, 200-201 

Reasoning factors, 223-224 

Regression and measurement error, 
70-73, 404-407 

Reliability of tests, 68-87 

Remmers, H. H., 431 

Report cards, 157-158 

Response sets, 371-372 

Retest reliability, 84-85 

Richardson, Bellows, Henry, and 
Company, Inc., 431 

Roberts, J. A. F., 431 

Role playing of test administration, 
105-106 

Rorschach Test, 355-358, 429 

Ross, C. C., 18, 107, 139, 393 

Rote memory factor, 224-225 


Saetveit, J. G., 295, 431 
Sampling error, of correlations, 66- 
67, 408-410 
of mean, 54-55 
Scatter diagram, 63-64 
Schedules of testing in school-wide 
program, 379-384 
School psychologists, 384-385 
Science on achievement tests, 193- 
195, 202-203 
Science Research Associates, 228, 
316, 417, 418, 424, 425, 428 
(See also Publishers of tests) 
Score points, 31 
Scoring errors, 79 
Scoring procedures, on essay tests, 
143-148 
on objective tests, 140-143 
Scott, J. D., 393 
Seashore, C. E., 295, 431 


t 


81111113 


Title 


Seashore, H. G., 58, 233, 237, 238, 
429 
Seashore, R. H., 298, 431 
Seashore Measures of Musical 
Talents, 295, 431 
Seeing relationships, factor of, 224 
Self-report, use of, in measuring 
attitudes, 335-339 
in measuring personality, 345- 
353 
Semantic Differential, 336-338 
Sentence completion as measure of 
personality, 360-361 
Sequential Tests of Educational 
Progress, 191, 418 
Sheridan Supply Co., 427 
(See also Publishers of tests) 
Shimberg, B., 431 
Shuttleworth, F. K., 430 
Sight-Screener, 280 
Simon, T., 216-217, 250, 429 
Situational adjustment, 345 
Small Parts Dexterity Test, 286, 429 
Smith, L. F., 300, 430 
Snellen chart, 279-280 
Social-distance scale, 338-339, 429 
Social studies on achievement tests, 
193-195, 202-203 
Social traits, 345 
Sociogram, 367-368 
Spatial factors, 226 
Spatial orientation factor, 226 
Spatial visualization factor, 226 
Spearman-Brown formula, 410—412 
Speed, effect of, on test results, 103- 
104, 228, 287-288 
Split-half reliability, 85-86 
SRA Achievement Series, 418-419 
SRA Junior Inventory, 347, 428, 431 
SRA Mechanical Aptitude Test, 291, 
431 
SRA Tests of Educational Ability, 
424 
SRA Youth Inventory, 347, 428, 431 
Stalnaker, J. M., 139 
Standard deviation as measure of 


dispersion, 36-37, 412-413 


439 
Index 


Standard error of measurement, 70, 
404-407 

Standard scores, 44-46 

Standardization of tests, effect of, 
on reliability, 80-81 

general principles of, 6-8 

Standardized Oral Reading Para- 
graphs, 430 

Stanford Achievement Tests, 419 

Stanford-Binet Intelligence Scale, 
251-257, 432 

Stanford University Press, 425, 426 

(See also Publishers of tests) 

Stanley, J. C., 13, 107, 139, 393 

Stenographic ability, 294 

Stewart, N., 271, 431 

Stoelting Co., 280 

Strang, Ruth, 165 

Stromberg, E. L., 286, 431 

Stromberg Dexterity Test, 286, 431 

Strong, E. K., 327, 431 

Strong Vocational Interest Blank, 
326-327, 425-426, 431 

Study skills, measurement of, 193 

Stutsman, R., 268, 432 

Subdivided-test reliability, 85-86 

Suci, G. J., 337, 431 

Suggestibilitv, 373 

Super, D. E., 303 

Symonds, P. M., 359, 432 

Symonds Picture-story Test, 359, 
432 


Tannenbaum, P., 337, 431 

Taylor, C. W., 304, 432 

Technical Recommendations for 
Achievement Tests, 27, 185 

Technical Recommendations for 
Psychological Tests and Diag- 
nostic Techniques, 28 

Telebinocular, 280 

Terman, L. M., 251, 252, 432 

Terman-McNemar Test of Mental 
Ability, 424 

Test length, effect of, on reliability, 
78 


440 


Index 


Test of Mechanical Comprehension, 
429 

Tests, sources of information about, 
387-390 

Tests in Fundamental Abilities of 
Visual Art, 431 

Tests in Print, 389, 429 

Thematic Apperception Test, 358— 
359, 431 

Thomas, R. M., 165 

Thorndike, R. L., 13, 88, 107, 139, 
165, 263, 303, 375, 393, 431 

Thurstone, L. L., 227, 432 

Tiegs, E. W., 432 

Timing tests, 103-104 

Tolan, T., 281, 432 

Trait measurements, 22—24 

Transformation of score distributions, 
37-38, 415 

True-false items, 114-116 

True scores, 69-70 


Valentine, C. W., 268, 432 

Van Wagenen, M. J., 267, 430 

Verbal behavior as measure of per- 
sonality, 372-373 


Verbal comprehension factor, 221 

Verbal fluency factor, 221-222, 311- 
312 

Vision, measures of acuity, 279-282 


Wall charts, 279-282 

Watson, L. A., 281, 432 

Wechsler, D., 257, 259, 432 

Wechsler Adult Intelligence Scale, 
257-258, 432 

Wechsler Intelligence Scale for 
Children, 259-262, 432 

Weighting of items, on essay tests, 
144-145 

on objective tests, 141 

Wesman, A. G., 233, 287, 288, 429 

Wing, H. D., 296, 432 

Wing Standardized Tests of Musical 
Intelligence, 296-297, 432 

Wood, Dorothy A., 165, 185 

Woodworth, R. S., 346, 432 

Word knowledge, measurement of, 
188-189 


Zaccaria, M. A., 280, 430 
Zimmerman, W. S., 350, 430 


Se Wes, 
0 Library 


N n ' 


, Calcutta 
Ke 


bd C. . 


2 : 
Form No. 3. 
PSY, RES.L-1 


Bureau of Educational & Psychological 
Research Library. 


The book is to be returned within 
the date stamped last. 


WBGP-59/60-51190-5M 


Form No. 4 
i BOOK CARD 


Coll. Ndvo Accn. No 


Author... 


Title. e 


